Step 1: Clean and prepare your data: There are several entries where values have been deleted to simulate dirty data. Please clean the data with whatever method(s) you believe is best/most suitable. Note that some of the missing values are truly blank (unknown answers) and thus may be impossible to clean; use your discretion.
Step 2: Build your models: Please build machine learning/statistical models in Python to predict the interest rate assigned to a loan. When writing the code associated with each model, please have the first part produce and save the model, followed by a second part that loads and applies the model.
Step 3: Test your models using the data found within the "Holdout for Testing" file. Save the results of the final model (remember you will only predict the first column in holdout test set with your best model results) in a single, separate CSV titled "Results from" *insert your name or UChicago net ID.
Step 4: Submit your work: Please submit all of your code for cleaning, prepping, and modeling your data, your "Results" file, a brief write-up comparing the pros and cons of the modeling techniques you used (no more than a paragraph). Your work will be scored on techniques used (appropriateness and complexity), model performance - measured by RMSE - on the data hold out, an understanding of the techniques you compared in your write-up, and your overall code.
%matplotlib inline
import numpy as np
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
import plotly.express as px
from sklearn import metrics
# Importing Regressor Modules
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error as MSE
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error as MSE
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)
loan_data = pd.read_csv('Data for Cleaning & Modeling.csv')
loan_data.head()
/var/folders/g6/gybbdfkx34769rq_nbzlxb0m0000gn/T/ipykernel_20382/1970426666.py:1: DtypeWarning: Columns (15) have mixed types. Specify dtype option on import or set low_memory=False.
loan_data = pd.read_csv('Data for Cleaning & Modeling.csv')
| X1 | X2 | X3 | X4 | X5 | X6 | X7 | X8 | X9 | X10 | X11 | X12 | X13 | X14 | X15 | X16 | X17 | X18 | X19 | X20 | X21 | X22 | X23 | X24 | X25 | X26 | X27 | X28 | X29 | X30 | X31 | X32 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 11.89% | 54734.0 | 80364.0 | $25,000 | $25,000 | $19,080 | 36 months | B | B4 | NaN | < 1 year | RENT | 85000.0 | VERIFIED - income | Aug-09 | Due to a lack of personal finance education an... | debt_consolidation | Debt consolidation for on-time payer | 941xx | CA | 19.48 | 0.0 | Feb-94 | 0.0 | NaN | NaN | 10.0 | 0.0 | 28854.0 | 52.10% | 42.0 | f |
| 1 | 10.71% | 55742.0 | 114426.0 | $7,000 | $7,000 | $673 | 36 months | B | B5 | CNN | < 1 year | RENT | 65000.0 | not verified | May-08 | Just want to pay off the last bit of credit ca... | credit_card | Credit Card payoff | 112xx | NY | 14.29 | 0.0 | Oct-00 | 0.0 | NaN | NaN | 7.0 | 0.0 | 33623.0 | 76.70% | 7.0 | f |
| 2 | 16.99% | 57167.0 | 137225.0 | $25,000 | $25,000 | $24,725 | 36 months | D | D3 | Web Programmer | 1 year | RENT | 70000.0 | VERIFIED - income | Aug-14 | Trying to pay a friend back for apartment brok... | debt_consolidation | mlue | 100xx | NY | 10.50 | 0.0 | Jun-00 | 0.0 | 41.0 | NaN | 10.0 | 0.0 | 19878.0 | 66.30% | 17.0 | f |
| 3 | 13.11% | 57245.0 | 138150.0 | $1,200 | $1,200 | $1,200 | 36 months | C | C2 | city of beaumont texas | 10+ years | OWN | 54000.0 | not verified | Mar-10 | If funded, I would use this loan consolidate t... | debt_consolidation | zxcvb | 777xx | TX | 5.47 | 0.0 | Jan-85 | 0.0 | 64.0 | NaN | 5.0 | 0.0 | 2584.0 | 40.40% | 31.0 | f |
| 4 | 13.57% | 57416.0 | 139635.0 | $10,800 | $10,800 | $10,692 | 36 months | C | C3 | State Farm Insurance | 6 years | RENT | 32000.0 | not verified | Nov-09 | I currently have a personal loan with Citifina... | debt_consolidation | Nicolechr1978 | 067xx | CT | 11.63 | 0.0 | Dec-96 | 1.0 | 58.0 | NaN | 14.0 | 0.0 | 3511.0 | 25.60% | 40.0 | f |
loan_data.columns = ['interest_rate',
'loan_id',
'borrower_id',
'loan_requested',
'loan_funded',
'investor_funded',
'payment_numbers',
'loan_grade',
'loan_subgrade',
'job_title',
'years_employed',
'home_ownership',
'annual_income',
'income_verification',
'issued_date',
'loan_reason',
'loan_category',
'loan_title',
'zip_code',
'state',
'debt_payment_ratio',
'delinquency_frequency',
'credit_earliest_date',
'creditor_inquiries',
'borrower_delinquency_recency',
'public_record_recency',
'open_credit_line_count',
'derogatory_public_record_count',
'total_credit',
'credit_utilization_rate',
'total_credit_line_count',
'initial_status'
]
loan_data.head()
| interest_rate | loan_id | borrower_id | loan_requested | loan_funded | investor_funded | payment_numbers | loan_grade | loan_subgrade | job_title | years_employed | home_ownership | annual_income | income_verification | issued_date | loan_reason | loan_category | loan_title | zip_code | state | debt_payment_ratio | delinquency_frequency | credit_earliest_date | creditor_inquiries | borrower_delinquency_recency | public_record_recency | open_credit_line_count | derogatory_public_record_count | total_credit | credit_utilization_rate | total_credit_line_count | initial_status | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 11.89% | 54734.0 | 80364.0 | $25,000 | $25,000 | $19,080 | 36 months | B | B4 | NaN | < 1 year | RENT | 85000.0 | VERIFIED - income | Aug-09 | Due to a lack of personal finance education an... | debt_consolidation | Debt consolidation for on-time payer | 941xx | CA | 19.48 | 0.0 | Feb-94 | 0.0 | NaN | NaN | 10.0 | 0.0 | 28854.0 | 52.10% | 42.0 | f |
| 1 | 10.71% | 55742.0 | 114426.0 | $7,000 | $7,000 | $673 | 36 months | B | B5 | CNN | < 1 year | RENT | 65000.0 | not verified | May-08 | Just want to pay off the last bit of credit ca... | credit_card | Credit Card payoff | 112xx | NY | 14.29 | 0.0 | Oct-00 | 0.0 | NaN | NaN | 7.0 | 0.0 | 33623.0 | 76.70% | 7.0 | f |
| 2 | 16.99% | 57167.0 | 137225.0 | $25,000 | $25,000 | $24,725 | 36 months | D | D3 | Web Programmer | 1 year | RENT | 70000.0 | VERIFIED - income | Aug-14 | Trying to pay a friend back for apartment brok... | debt_consolidation | mlue | 100xx | NY | 10.50 | 0.0 | Jun-00 | 0.0 | 41.0 | NaN | 10.0 | 0.0 | 19878.0 | 66.30% | 17.0 | f |
| 3 | 13.11% | 57245.0 | 138150.0 | $1,200 | $1,200 | $1,200 | 36 months | C | C2 | city of beaumont texas | 10+ years | OWN | 54000.0 | not verified | Mar-10 | If funded, I would use this loan consolidate t... | debt_consolidation | zxcvb | 777xx | TX | 5.47 | 0.0 | Jan-85 | 0.0 | 64.0 | NaN | 5.0 | 0.0 | 2584.0 | 40.40% | 31.0 | f |
| 4 | 13.57% | 57416.0 | 139635.0 | $10,800 | $10,800 | $10,692 | 36 months | C | C3 | State Farm Insurance | 6 years | RENT | 32000.0 | not verified | Nov-09 | I currently have a personal loan with Citifina... | debt_consolidation | Nicolechr1978 | 067xx | CT | 11.63 | 0.0 | Dec-96 | 1.0 | 58.0 | NaN | 14.0 | 0.0 | 3511.0 | 25.60% | 40.0 | f |
loan_data.shape
(400000, 32)
na_value = loan_data.isna().sum()
na_value
interest_rate 61010 loan_id 1 borrower_id 1 loan_requested 1 loan_funded 1 investor_funded 1 payment_numbers 1 loan_grade 61270 loan_subgrade 61270 job_title 23986 years_employed 17538 home_ownership 61361 annual_income 61028 income_verification 1 issued_date 1 loan_reason 276440 loan_category 1 loan_title 19 zip_code 1 state 1 debt_payment_ratio 1 delinquency_frequency 1 credit_earliest_date 1 creditor_inquiries 1 borrower_delinquency_recency 218802 public_record_recency 348845 open_credit_line_count 1 derogatory_public_record_count 1 total_credit 1 credit_utilization_rate 267 total_credit_line_count 1 initial_status 1 dtype: int64
na_df = pd.DataFrame(na_value)
na_df['NA_count'] = na_df
na_df['NA%'] = (na_df['NA_count'] / loan_data.shape[0])*100
na_df = na_df.iloc[:,1:3]
na_df
| NA_count | NA% | |
|---|---|---|
| interest_rate | 61010 | 15.25250 |
| loan_id | 1 | 0.00025 |
| borrower_id | 1 | 0.00025 |
| loan_requested | 1 | 0.00025 |
| loan_funded | 1 | 0.00025 |
| investor_funded | 1 | 0.00025 |
| payment_numbers | 1 | 0.00025 |
| loan_grade | 61270 | 15.31750 |
| loan_subgrade | 61270 | 15.31750 |
| job_title | 23986 | 5.99650 |
| years_employed | 17538 | 4.38450 |
| home_ownership | 61361 | 15.34025 |
| annual_income | 61028 | 15.25700 |
| income_verification | 1 | 0.00025 |
| issued_date | 1 | 0.00025 |
| loan_reason | 276440 | 69.11000 |
| loan_category | 1 | 0.00025 |
| loan_title | 19 | 0.00475 |
| zip_code | 1 | 0.00025 |
| state | 1 | 0.00025 |
| debt_payment_ratio | 1 | 0.00025 |
| delinquency_frequency | 1 | 0.00025 |
| credit_earliest_date | 1 | 0.00025 |
| creditor_inquiries | 1 | 0.00025 |
| borrower_delinquency_recency | 218802 | 54.70050 |
| public_record_recency | 348845 | 87.21125 |
| open_credit_line_count | 1 | 0.00025 |
| derogatory_public_record_count | 1 | 0.00025 |
| total_credit | 1 | 0.00025 |
| credit_utilization_rate | 267 | 0.06675 |
| total_credit_line_count | 1 | 0.00025 |
| initial_status | 1 | 0.00025 |
loan_data = loan_data.dropna(subset=['interest_rate'])
na_value = loan_data.isna().sum()
na_df = pd.DataFrame(na_value)
na_df['NA_count'] = na_df
na_df['NA%'] = (na_df['NA_count'] / loan_data.shape[0])*100
na_df = na_df.iloc[:,1:3]
na_df
| NA_count | NA% | |
|---|---|---|
| interest_rate | 0 | 0.000000 |
| loan_id | 1 | 0.000295 |
| borrower_id | 1 | 0.000295 |
| loan_requested | 1 | 0.000295 |
| loan_funded | 1 | 0.000295 |
| investor_funded | 1 | 0.000295 |
| payment_numbers | 1 | 0.000295 |
| loan_grade | 51867 | 15.300451 |
| loan_subgrade | 51867 | 15.300451 |
| job_title | 20257 | 5.975692 |
| years_employed | 14795 | 4.364436 |
| home_ownership | 51960 | 15.327886 |
| annual_income | 51752 | 15.266527 |
| income_verification | 1 | 0.000295 |
| issued_date | 1 | 0.000295 |
| loan_reason | 233934 | 69.009115 |
| loan_category | 1 | 0.000295 |
| loan_title | 17 | 0.005015 |
| zip_code | 1 | 0.000295 |
| state | 1 | 0.000295 |
| debt_payment_ratio | 1 | 0.000295 |
| delinquency_frequency | 1 | 0.000295 |
| credit_earliest_date | 1 | 0.000295 |
| creditor_inquiries | 1 | 0.000295 |
| borrower_delinquency_recency | 185457 | 54.708693 |
| public_record_recency | 295590 | 87.197262 |
| open_credit_line_count | 1 | 0.000295 |
| derogatory_public_record_count | 1 | 0.000295 |
| total_credit | 1 | 0.000295 |
| credit_utilization_rate | 225 | 0.066374 |
| total_credit_line_count | 1 | 0.000295 |
| initial_status | 1 | 0.000295 |
loan_data = loan_data.drop(columns = ['loan_reason', 'borrower_delinquency_recency', 'public_record_recency'])
loan_data.shape
(338990, 29)
loan_data.loan_grade.value_counts()
B 86121 C 76446 D 46984 A 45525 E 21628 F 8395 G 2024 Name: loan_grade, dtype: int64
len(loan_data.loan_grade.value_counts())
7
loan_data.loan_subgrade.value_counts()
B3 20352 B4 19137 B2 16767 C1 16342 C2 16310 B5 15521 C3 15425 C4 14646 B1 14344 C5 13723 A5 13086 A4 11806 D1 11720 D2 10498 D3 9091 D4 8573 A3 7653 D5 7102 A2 6496 A1 6484 E1 5447 E2 5246 E3 4230 E4 3640 E5 3065 F1 2490 F2 1873 F3 1712 F4 1331 F5 989 G1 677 G2 511 G3 378 G4 252 G5 206 Name: loan_subgrade, dtype: int64
len(loan_data.loan_subgrade.value_counts())
35
loan_data = loan_data.drop(columns = ['loan_id', 'borrower_id', 'loan_subgrade', 'zip_code'])
loan_data.shape
(338990, 25)
# issued_date
loan_data['issued_month'] = loan_data['issued_date'].apply(lambda x: str(x)[:3])
loan_data['issued_year'] = loan_data['issued_date'].apply(lambda x: str(x)[-2:])
# credit_earliest date
loan_data['credit_earliest_month'] = loan_data['credit_earliest_date'].apply(lambda x: str(x)[:3])
loan_data['credit_earliest_year'] = loan_data['credit_earliest_date'].apply(lambda x: str(x)[-2:])
loan_data = loan_data.drop(['issued_date', 'credit_earliest_date' ], axis=1)
loan_data['credit_earliest_month'].unique()
array(['Feb', 'Oct', 'Jun', 'Jan', 'Dec', 'Apr', 'Nov', 'Jul', 'May',
'Aug', 'Sep', 'Mar', 'nan'], dtype=object)
loan_data.loc[loan_data['credit_earliest_month'] == "nan"]
| interest_rate | loan_requested | loan_funded | investor_funded | payment_numbers | loan_grade | job_title | years_employed | home_ownership | annual_income | income_verification | loan_category | loan_title | state | debt_payment_ratio | delinquency_frequency | creditor_inquiries | open_credit_line_count | derogatory_public_record_count | total_credit | credit_utilization_rate | total_credit_line_count | initial_status | issued_month | issued_year | credit_earliest_month | credit_earliest_year | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 364111 | 7.69% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | nan | an | nan | an |
loan_data = loan_data.loc[loan_data['credit_earliest_month'] != 'nan']
loan_data['credit_earliest_year'].unique()
array(['94', '00', '85', '96', '98', '93', '01', '06', '95', '97', '90',
'91', '03', '05', '75', '02', '88', '86', '87', '92', '89', '99',
'04', '83', '80', '84', '71', '63', '81', '76', '82', '79', '72',
'78', '73', '77', '74', '69', '70', '68', '67', '64', '07', '61',
'62', '66', '65', '54', '08', '50', '46', '59', '56', '09', '58',
'60', '10', '55', '53', '57', '11', '44', '51', '49'], dtype=object)
loan_data.loc[loan_data['credit_earliest_year'] == "an"]
| interest_rate | loan_requested | loan_funded | investor_funded | payment_numbers | loan_grade | job_title | years_employed | home_ownership | annual_income | income_verification | loan_category | loan_title | state | debt_payment_ratio | delinquency_frequency | creditor_inquiries | open_credit_line_count | derogatory_public_record_count | total_credit | credit_utilization_rate | total_credit_line_count | initial_status | issued_month | issued_year | credit_earliest_month | credit_earliest_year |
|---|
loan_data = loan_data.loc[loan_data['credit_earliest_year'] != "an"]
loan_data['issued_month'].unique()
array(['Aug', 'May', 'Mar', 'Nov', 'Apr', 'Sep', 'Feb', 'Dec', 'Jun',
'Oct', 'Jul', 'Jan'], dtype=object)
loan_data.loc[loan_data['issued_month'] == "nan"]
| interest_rate | loan_requested | loan_funded | investor_funded | payment_numbers | loan_grade | job_title | years_employed | home_ownership | annual_income | income_verification | loan_category | loan_title | state | debt_payment_ratio | delinquency_frequency | creditor_inquiries | open_credit_line_count | derogatory_public_record_count | total_credit | credit_utilization_rate | total_credit_line_count | initial_status | issued_month | issued_year | credit_earliest_month | credit_earliest_year |
|---|
loan_data = loan_data.loc[loan_data['issued_month'] != "nan"]
loan_data['issued_year'].unique()
array(['09', '08', '14', '10', '12', '07', '11', '13'], dtype=object)
loan_data.loc[loan_data['issued_year'] == "an"]
| interest_rate | loan_requested | loan_funded | investor_funded | payment_numbers | loan_grade | job_title | years_employed | home_ownership | annual_income | income_verification | loan_category | loan_title | state | debt_payment_ratio | delinquency_frequency | creditor_inquiries | open_credit_line_count | derogatory_public_record_count | total_credit | credit_utilization_rate | total_credit_line_count | initial_status | issued_month | issued_year | credit_earliest_month | credit_earliest_year |
|---|
loan_data = loan_data.loc[loan_data['issued_year'] != "an"]
loan_data['credit_earliest_year'].unique()
array(['94', '00', '85', '96', '98', '93', '01', '06', '95', '97', '90',
'91', '03', '05', '75', '02', '88', '86', '87', '92', '89', '99',
'04', '83', '80', '84', '71', '63', '81', '76', '82', '79', '72',
'78', '73', '77', '74', '69', '70', '68', '67', '64', '07', '61',
'62', '66', '65', '54', '08', '50', '46', '59', '56', '09', '58',
'60', '10', '55', '53', '57', '11', '44', '51', '49'], dtype=object)
loan_data['credit_earliest_month'].unique()
array(['Feb', 'Oct', 'Jun', 'Jan', 'Dec', 'Apr', 'Nov', 'Jul', 'May',
'Aug', 'Sep', 'Mar'], dtype=object)
loan_data['issued_year'].unique()
array(['09', '08', '14', '10', '12', '07', '11', '13'], dtype=object)
loan_data['issued_month'].unique()
array(['Aug', 'May', 'Mar', 'Nov', 'Apr', 'Sep', 'Feb', 'Dec', 'Jun',
'Oct', 'Jul', 'Jan'], dtype=object)
# Define month convertion function
def mtn(x):
months = {
"jan": 1,
"feb": 2,
"mar": 3,
"apr": 4,
"may": 5,
"jun": 6,
"jul": 7,
"aug": 8,
"sep": 9,
"oct": 10,
"nov": 11,
"dec": 12,
}
try: # convert all the month into lowercase to match them to numbers
x = months[x.strip()[:3].lower()]
return x
except:
pass
# Define year convertion function
# Since it's only 2023 now, I assume that last 2 digits greater than 30 is in 1900 century
# for last 2 digits less than 30, I assume that it's in 2000 century
def year(x):
x = x.strip()[-2:]
try:
if float(x) > 30:
x = "19" + str(x)
else: #
x = "20" + str(x)
return int(x)
except:
pass
loan_data["issued_month"].unique()
array(['Aug', 'May', 'Mar', 'Nov', 'Apr', 'Sep', 'Feb', 'Dec', 'Jun',
'Oct', 'Jul', 'Jan'], dtype=object)
# change issued_date & credit_earliest_date's Month & Year
loan_data['issued_month'] = loan_data['issued_month'].apply(lambda x: mtn(x))
loan_data['credit_earliest_month'] = loan_data['credit_earliest_month'] .apply(lambda x: mtn(x))
loan_data['issued_year'] = loan_data['issued_year'].apply(lambda x: year(x))
loan_data['credit_earliest_year'] = loan_data['credit_earliest_year'].apply(lambda x: year(x))
loan_data.head()
| interest_rate | loan_requested | loan_funded | investor_funded | payment_numbers | loan_grade | job_title | years_employed | home_ownership | annual_income | income_verification | loan_category | loan_title | state | debt_payment_ratio | delinquency_frequency | creditor_inquiries | open_credit_line_count | derogatory_public_record_count | total_credit | credit_utilization_rate | total_credit_line_count | initial_status | issued_month | issued_year | credit_earliest_month | credit_earliest_year | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 11.89% | $25,000 | $25,000 | $19,080 | 36 months | B | NaN | < 1 year | RENT | 85000.0 | VERIFIED - income | debt_consolidation | Debt consolidation for on-time payer | CA | 19.48 | 0.0 | 0.0 | 10.0 | 0.0 | 28854.0 | 52.10% | 42.0 | f | 8 | 2009 | 2 | 1994 |
| 1 | 10.71% | $7,000 | $7,000 | $673 | 36 months | B | CNN | < 1 year | RENT | 65000.0 | not verified | credit_card | Credit Card payoff | NY | 14.29 | 0.0 | 0.0 | 7.0 | 0.0 | 33623.0 | 76.70% | 7.0 | f | 5 | 2008 | 10 | 2000 |
| 2 | 16.99% | $25,000 | $25,000 | $24,725 | 36 months | D | Web Programmer | 1 year | RENT | 70000.0 | VERIFIED - income | debt_consolidation | mlue | NY | 10.50 | 0.0 | 0.0 | 10.0 | 0.0 | 19878.0 | 66.30% | 17.0 | f | 8 | 2014 | 6 | 2000 |
| 3 | 13.11% | $1,200 | $1,200 | $1,200 | 36 months | C | city of beaumont texas | 10+ years | OWN | 54000.0 | not verified | debt_consolidation | zxcvb | TX | 5.47 | 0.0 | 0.0 | 5.0 | 0.0 | 2584.0 | 40.40% | 31.0 | f | 3 | 2010 | 1 | 1985 |
| 4 | 13.57% | $10,800 | $10,800 | $10,692 | 36 months | C | State Farm Insurance | 6 years | RENT | 32000.0 | not verified | debt_consolidation | Nicolechr1978 | CT | 11.63 | 0.0 | 1.0 | 14.0 | 0.0 | 3511.0 | 25.60% | 40.0 | f | 11 | 2009 | 12 | 1996 |
countplt, ax = plt.subplots(figsize = (10,7))
ax = sns.countplot(x='issued_month',
data=loan_data,
order = loan_data['issued_month'].value_counts().index)
countplt, ax = plt.subplots(figsize = (10,7))
ax = sns.countplot(x='issued_year',
data=loan_data,
order = loan_data['issued_year'].value_counts().index)
countplt, ax = plt.subplots(figsize = (10,7))
ax = sns.countplot(x='credit_earliest_month',
data=loan_data,
order = loan_data['credit_earliest_month'].value_counts().index)
countplt, ax = plt.subplots(figsize = (10,7))
ax = sns.countplot(x='credit_earliest_year',
data=loan_data,
order = loan_data['credit_earliest_year'].value_counts().index[:10])
loan_data.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 338989 entries, 0 to 399999 Data columns (total 27 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 interest_rate 338989 non-null object 1 loan_requested 338989 non-null object 2 loan_funded 338989 non-null object 3 investor_funded 338989 non-null object 4 payment_numbers 338989 non-null object 5 loan_grade 287123 non-null object 6 job_title 318733 non-null object 7 years_employed 324195 non-null object 8 home_ownership 287030 non-null object 9 annual_income 287238 non-null float64 10 income_verification 338989 non-null object 11 loan_category 338989 non-null object 12 loan_title 338973 non-null object 13 state 338989 non-null object 14 debt_payment_ratio 338989 non-null float64 15 delinquency_frequency 338989 non-null float64 16 creditor_inquiries 338989 non-null float64 17 open_credit_line_count 338989 non-null float64 18 derogatory_public_record_count 338989 non-null float64 19 total_credit 338989 non-null float64 20 credit_utilization_rate 338765 non-null object 21 total_credit_line_count 338989 non-null float64 22 initial_status 338989 non-null object 23 issued_month 338989 non-null int64 24 issued_year 338989 non-null int64 25 credit_earliest_month 338989 non-null int64 26 credit_earliest_year 338989 non-null int64 dtypes: float64(8), int64(4), object(15) memory usage: 72.4+ MB
for col in ['loan_requested', 'loan_funded', 'investor_funded']:
loan_data[col] = loan_data[col].str.slice(start=1).str.replace(",","").astype("float")
for col in ['interest_rate', 'credit_utilization_rate']:
loan_data[col] = loan_data[col].str.slice(stop=-1).astype("float")
loan_data.hist(column = 'interest_rate')
array([[<AxesSubplot:title={'center':'interest_rate'}>]], dtype=object)
loan_data['interest_rate'].describe()
count 338989.000000 mean 13.946289 std 4.377945 min 5.420000 25% 10.990000 50% 13.680000 75% 16.780000 max 26.060000 Name: interest_rate, dtype: float64
loan_data.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 338989 entries, 0 to 399999 Data columns (total 27 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 interest_rate 338989 non-null float64 1 loan_requested 338989 non-null float64 2 loan_funded 338989 non-null float64 3 investor_funded 338989 non-null float64 4 payment_numbers 338989 non-null object 5 loan_grade 287123 non-null object 6 job_title 318733 non-null object 7 years_employed 324195 non-null object 8 home_ownership 287030 non-null object 9 annual_income 287238 non-null float64 10 income_verification 338989 non-null object 11 loan_category 338989 non-null object 12 loan_title 338973 non-null object 13 state 338989 non-null object 14 debt_payment_ratio 338989 non-null float64 15 delinquency_frequency 338989 non-null float64 16 creditor_inquiries 338989 non-null float64 17 open_credit_line_count 338989 non-null float64 18 derogatory_public_record_count 338989 non-null float64 19 total_credit 338989 non-null float64 20 credit_utilization_rate 338765 non-null float64 21 total_credit_line_count 338989 non-null float64 22 initial_status 338989 non-null object 23 issued_month 338989 non-null int64 24 issued_year 338989 non-null int64 25 credit_earliest_month 338989 non-null int64 26 credit_earliest_year 338989 non-null int64 dtypes: float64(13), int64(4), object(10) memory usage: 72.4+ MB
loan_data.loan_requested.describe()
count 338989.000000 mean 14276.968353 std 8252.264104 min 500.000000 25% 8000.000000 50% 12000.000000 75% 20000.000000 max 35000.000000 Name: loan_requested, dtype: float64
sns.distplot(loan_data[loan_data['loan_requested'].notnull()]['loan_requested'])
/var/folders/g6/gybbdfkx34769rq_nbzlxb0m0000gn/T/ipykernel_20382/454341747.py:1: UserWarning: `distplot` is a deprecated function and will be removed in seaborn v0.14.0. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). For a guide to updating your code to use the new functions, please see https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751 sns.distplot(loan_data[loan_data['loan_requested'].notnull()]['loan_requested'])
<AxesSubplot:xlabel='loan_requested', ylabel='Density'>
loan_data['loan_requested'].fillna(loan_data['loan_requested'].median(), inplace=True)
sns.distplot(loan_data[loan_data['loan_requested'].notnull()]['loan_requested'])
/var/folders/g6/gybbdfkx34769rq_nbzlxb0m0000gn/T/ipykernel_20382/454341747.py:1: UserWarning: `distplot` is a deprecated function and will be removed in seaborn v0.14.0. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). For a guide to updating your code to use the new functions, please see https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751 sns.distplot(loan_data[loan_data['loan_requested'].notnull()]['loan_requested'])
<AxesSubplot:xlabel='loan_requested', ylabel='Density'>
loan_data.loan_requested.describe()
count 338989.000000 mean 14276.968353 std 8252.264104 min 500.000000 25% 8000.000000 50% 12000.000000 75% 20000.000000 max 35000.000000 Name: loan_requested, dtype: float64
loan_data.loan_funded.describe()
count 338989.000000 mean 14248.306140 std 8238.064335 min 500.000000 25% 8000.000000 50% 12000.000000 75% 20000.000000 max 35000.000000 Name: loan_funded, dtype: float64
sns.distplot(loan_data[loan_data['loan_funded'].notnull()]['loan_funded'])
/var/folders/g6/gybbdfkx34769rq_nbzlxb0m0000gn/T/ipykernel_20382/3528064487.py:1: UserWarning: `distplot` is a deprecated function and will be removed in seaborn v0.14.0. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). For a guide to updating your code to use the new functions, please see https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751 sns.distplot(loan_data[loan_data['loan_funded'].notnull()]['loan_funded'])
<AxesSubplot:xlabel='loan_funded', ylabel='Density'>
loan_data['loan_funded'].fillna(loan_data['loan_funded'].median(), inplace=True)
loan_data.loan_funded.describe()
count 338989.000000 mean 14248.306140 std 8238.064335 min 500.000000 25% 8000.000000 50% 12000.000000 75% 20000.000000 max 35000.000000 Name: loan_funded, dtype: float64
loan_data.investor_funded.describe()
count 338989.000000 mean 14182.000445 std 8255.138412 min 0.000000 25% 8000.000000 50% 12000.000000 75% 19900.000000 max 35000.000000 Name: investor_funded, dtype: float64
sns.distplot(loan_data[loan_data['investor_funded'].notnull()]['investor_funded'])
/var/folders/g6/gybbdfkx34769rq_nbzlxb0m0000gn/T/ipykernel_20382/2461909109.py:1: UserWarning: `distplot` is a deprecated function and will be removed in seaborn v0.14.0. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). For a guide to updating your code to use the new functions, please see https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751 sns.distplot(loan_data[loan_data['investor_funded'].notnull()]['investor_funded'])
<AxesSubplot:xlabel='investor_funded', ylabel='Density'>
loan_data['investor_funded'].fillna(loan_data['investor_funded'].median(), inplace=True)
loan_data.investor_funded.describe()
count 338989.000000 mean 14182.000445 std 8255.138412 min 0.000000 25% 8000.000000 50% 12000.000000 75% 19900.000000 max 35000.000000 Name: investor_funded, dtype: float64
loan_data.payment_numbers.value_counts()
36 months 247791 60 months 91198 Name: payment_numbers, dtype: int64
247791/(247791+ 91198)
0.7309706214655933
loan_data['is_36months'] = loan_data['payment_numbers'].apply(lambda x: 1 if x == '36 months' else 0)
loan_data = loan_data.drop('payment_numbers', axis=1)
loan_data.loan_grade.value_counts()
B 86121 C 76446 D 46984 A 45525 E 21628 F 8395 G 2024 Name: loan_grade, dtype: int64
sns.countplot(x='loan_grade', data=loan_data)
<AxesSubplot:xlabel='loan_grade', ylabel='count'>
loan_data.loan_grade = loan_data.loan_grade.fillna('Other')
loan_data.loan_grade.value_counts()
B 86121 C 76446 Other 51866 D 46984 A 45525 E 21628 F 8395 G 2024 Name: loan_grade, dtype: int64
len(loan_data.job_title.value_counts())
163395
loan_data.job_title = loan_data.job_title.fillna('Other')
loan_data = loan_data.drop('job_title', axis=1)
loan_data.years_employed.value_counts()
10+ years 108491 2 years 30117 3 years 26670 < 1 year 26003 5 years 23072 1 year 21432 4 years 20259 6 years 19601 7 years 19445 8 years 16212 9 years 12893 Name: years_employed, dtype: int64
loan_data['years_employed'] = loan_data['years_employed'].str.replace("< 1", "0").replace("+", "")
loan_data['years_employed'] = loan_data['years_employed'].str.strip()
loan_data['years_employed'] = loan_data['years_employed'].str.slice(stop=2)
loan_data.years_employed.value_counts()
10 108491 2 30117 3 26670 0 26003 5 23072 1 21432 4 20259 6 19601 7 19445 8 16212 9 12893 Name: years_employed, dtype: int64
# loan_data['years_employed'] = loan_data['years_employed'].apply(lambda x: x.replace("< 1", "0").replace("+", "").strip()[:2])
loan_data.years_employed.unique()
array(['0 ', '1 ', '10', '6 ', '9 ', '3 ', '5 ', '2 ', '4 ', '8 ', '7 ',
nan], dtype=object)
sns.histplot(loan_data[loan_data["years_employed"].notnull()]["years_employed"])
plt.show()
loan_data["years_employed"].describe()
count 324195 unique 11 top 10 freq 108491 Name: years_employed, dtype: object
loan_data['years_employed'] = loan_data['years_employed'].astype('float')
sns.histplot(loan_data[loan_data["years_employed"].notnull()]["years_employed"])
plt.show()
loan_data["years_employed"].describe()
count 324195.000000 mean 5.991567 std 3.613908 min 0.000000 25% 3.000000 50% 6.000000 75% 10.000000 max 10.000000 Name: years_employed, dtype: float64
loan_data['years_employed'] = loan_data['years_employed'].interpolate()
sns.histplot(loan_data[loan_data["years_employed"].notnull()]["years_employed"])
plt.show()
loan_data["years_employed"].describe()
count 338989.000000 mean 5.996765 std 3.574865 min 0.000000 25% 3.000000 50% 6.000000 75% 10.000000 max 10.000000 Name: years_employed, dtype: float64
year_condition = [(loan_data['years_employed'] >= 0) & (loan_data['years_employed'] < 6),
(loan_data['years_employed'] >= 6) & (loan_data['years_employed'] < 10),
loan_data['years_employed'] >= 10]
year_em_categories = ['G1', 'G2', 'G3']
loan_data['YE_categories'] = np.select(year_condition, year_em_categories)
sns.countplot(x='YE_categories', data=loan_data)
<AxesSubplot:xlabel='YE_categories', ylabel='count'>
loan_data.YE_categories.value_counts()
G1 154032 G3 110301 G2 74656 Name: YE_categories, dtype: int64
loan_data["YE_categories"] = loan_data["YE_categories"].astype("category")
loan_data = loan_data.drop('years_employed', axis=1)
loan_data.home_ownership.value_counts()
MORTGAGE 145958 RENT 115958 OWN 24976 OTHER 107 NONE 30 ANY 1 Name: home_ownership, dtype: int64
sns.countplot(x='home_ownership', data=loan_data)
<AxesSubplot:xlabel='home_ownership', ylabel='count'>
loan_data["home_ownership"] = loan_data["home_ownership"].str.replace("NONE", "OTHER").str.replace("ANY", "OTHER")
loan_data["home_ownership"] = loan_data["home_ownership"] .fillna("OTHER")
loan_data["home_ownership"].value_counts()
MORTGAGE 145958 RENT 115958 OTHER 52097 OWN 24976 Name: home_ownership, dtype: int64
sns.countplot(x='home_ownership', data=loan_data)
<AxesSubplot:xlabel='home_ownership', ylabel='count'>
loan_data.annual_income.describe()
count 2.872380e+05 mean 7.315146e+04 std 5.618967e+04 min 3.000000e+03 25% 4.500000e+04 50% 6.300000e+04 75% 8.807875e+04 max 7.500000e+06 Name: annual_income, dtype: float64
sns.distplot(loan_data[loan_data['annual_income'].notnull()]['annual_income'])
/var/folders/g6/gybbdfkx34769rq_nbzlxb0m0000gn/T/ipykernel_20382/3601847436.py:1: UserWarning: `distplot` is a deprecated function and will be removed in seaborn v0.14.0. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). For a guide to updating your code to use the new functions, please see https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751 sns.distplot(loan_data[loan_data['annual_income'].notnull()]['annual_income'])
<AxesSubplot:xlabel='annual_income', ylabel='Density'>
len(loan_data.annual_income.unique())
22155
max(loan_data.annual_income)
7500000.0
min(loan_data.annual_income)
3000.0
loan_data.annual_income.mean()
73151.46409611532
loan_data.annual_income.median()
63000.0
loan_data.annual_income = loan_data.annual_income.interpolate()
loan_data.annual_income.describe()
count 3.389890e+05 mean 7.311117e+04 std 5.369898e+04 min 3.000000e+03 25% 4.600000e+04 50% 6.400000e+04 75% 8.750000e+04 max 7.500000e+06 Name: annual_income, dtype: float64
sns.distplot(loan_data[loan_data['annual_income'].notnull()]['annual_income'])
/var/folders/g6/gybbdfkx34769rq_nbzlxb0m0000gn/T/ipykernel_20382/3601847436.py:1: UserWarning: `distplot` is a deprecated function and will be removed in seaborn v0.14.0. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). For a guide to updating your code to use the new functions, please see https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751 sns.distplot(loan_data[loan_data['annual_income'].notnull()]['annual_income'])
<AxesSubplot:xlabel='annual_income', ylabel='Density'>
income_condition = [(loan_data['annual_income'] >= 0) & (loan_data['annual_income'] < 100000),
(loan_data['annual_income'] >= 100000) & (loan_data['annual_income'] < 200000),
(loan_data['annual_income'] >= 200000) & (loan_data['annual_income'] < 300000),
(loan_data['annual_income'] >= 300000) & (loan_data['annual_income'] < 400000),
(loan_data['annual_income'] >= 400000) & (loan_data['annual_income'] < 500000),
(loan_data['annual_income'] >= 500000) & (loan_data['annual_income'] < 600000),
(loan_data['annual_income'] >= 600000) & (loan_data['annual_income'] < 700000),
loan_data['annual_income'] >= 700000]
income_categories = ['G1', 'G2', 'G3', 'G4', 'G5', 'G6', 'G7', 'G8']
loan_data['income_categories'] = np.select(income_condition, income_categories)
loan_data['income_categories'].value_counts()
G1 278962 G2 54035 G3 4409 G4 933 G5 306 G6 145 G8 141 G7 58 Name: income_categories, dtype: int64
sns.countplot(x='income_categories', data=loan_data)
<AxesSubplot:xlabel='income_categories', ylabel='count'>
loan_data.groupby('income_categories')[['income_categories', 'interest_rate']].mean()
| interest_rate | |
|---|---|
| income_categories | |
| G1 | 14.020590 |
| G2 | 13.591988 |
| G3 | 13.611095 |
| G4 | 13.770472 |
| G5 | 14.598562 |
| G6 | 13.741517 |
| G7 | 13.613621 |
| G8 | 13.299362 |
loan_data['income_over_100000'] = loan_data['income_categories'].apply(lambda x:1 if x =='G1' else 0)
loan_data['income_over_100000'].value_counts()
1 278962 0 60027 Name: income_over_100000, dtype: int64
# drop income categories
loan_data = loan_data.drop(columns = ['income_categories'])
loan_data['income_verification'].value_counts()
VERIFIED - income 127040 not verified 107873 VERIFIED - income source 104076 Name: income_verification, dtype: int64
loan_data["income_verification"] = loan_data["income_verification"].str.replace("VERIFIED - income", "verified").str.replace("verified source","verified")
loan_data["income_verification"] = loan_data["income_verification"] .fillna("verified")
loan_data['income_verification'].value_counts()
verified 231116 not verified 107873 Name: income_verification, dtype: int64
sns.countplot(x='income_verification', data=loan_data)
<AxesSubplot:xlabel='income_verification', ylabel='count'>
loan_data['loan_category'].value_counts()
debt_consolidation 198226 credit_card 75680 home_improvement 19625 other 17154 major_purchase 7312 small_business 5359 car 4115 medical 3329 moving 2138 wedding 1934 vacation 1848 house 1723 educational 279 renewable_energy 267 Name: loan_category, dtype: int64
countplt, ax = plt.subplots(figsize = (10,7))
ax = sns.countplot(x='loan_category',
data=loan_data,
order = loan_data['loan_category'].value_counts().index[:5])
loan_data.groupby('loan_category')[['loan_category', 'interest_rate']].mean()
| interest_rate | |
|---|---|
| loan_category | |
| car | 12.044260 |
| credit_card | 12.740792 |
| debt_consolidation | 14.228413 |
| educational | 11.476344 |
| home_improvement | 13.623039 |
| house | 15.048247 |
| major_purchase | 12.826208 |
| medical | 15.444302 |
| moving | 16.084233 |
| other | 15.889428 |
| renewable_energy | 15.028652 |
| small_business | 15.892461 |
| vacation | 15.280622 |
| wedding | 14.144917 |
loan_data['loan_title'] = loan_data['loan_title'].astype('str').apply(lambda x: x.lower())
# loan_data['loan_title'].value_counts()
len(loan_data['loan_title'].value_counts())
45276
# Debt consolidation
123592/338973
0.3646072106037944
# Credit card refinancing
40820/338973
0.12042257052921619
There are 45276 different loan title for this variable, I will drop this variable. Will consider using clustering techniques such as K-means, DCSCAN, hirarchical clustering to do it if the model performance is not good enough
loan_data = loan_data.drop(columns = ['loan_title'])
loan_data['state'].value_counts()
CA 52835 NY 29226 TX 26493 FL 22756 IL 13483 NJ 13188 PA 11877 OH 11040 GA 10851 VA 10338 NC 9303 MI 8350 MA 8038 MD 8016 AZ 7747 WA 7706 CO 7119 MN 5865 MO 5398 CT 5243 NV 4753 IN 4608 OR 4411 WI 4245 TN 4215 AL 4188 LA 4017 SC 3979 KY 3173 KS 3096 OK 3014 AR 2532 UT 2531 NM 1848 HI 1798 WV 1738 NH 1648 RI 1486 DC 1080 MT 995 AK 947 DE 896 WY 853 SD 730 MS 707 VT 604 ID 8 IA 7 NE 6 ME 4 Name: state, dtype: int64
len(loan_data['state'].value_counts())
50
states = {
"AK": "Other",
"AL": "South",
"AR": "South",
"AS": "Other",
"AZ": "West",
"CA": "West",
"CO": "West",
"CT": "Northeast",
"DC": "Northeast",
"DE": "Northeast",
"FL": "South",
"GA": "South",
"GU": "Other",
"HI": "Other",
"IA": "Midwest",
"ID": "West",
"IL": "Midwest",
"IN": "Midwest",
"KS": "Midwest",
"KY": "South",
"LA": "South",
"MA": "Northeast",
"MD": "Northeast",
"ME": "Northeast",
"MI": "West",
"MN": "Midwest",
"MO": "Midwest",
"MP": "Other",
"MS": "South",
"MT": "West",
"NA": "Other",
"NC": "South",
"ND": "Midwest",
"NE": "West",
"NH": "Northeast",
"NJ": "Northeast",
"NM": "West",
"NV": "West",
"NY": "Northeast",
"OH": "Midwest",
"OK": "South",
"OR": "West",
"PA": "Northeast",
"PR": "Other",
"RI": "Northeast",
"SC": "South",
"SD": "Midwest",
"TN": "South",
"TX": "South",
"UT": "West",
"VA": "South",
"VI": "Other",
"VT": "Northeast",
"WA": "West",
"WI": "Midwest",
"WV": "South",
"WY": "West",
}
loan_data["region"] = loan_data["state"].map(states)
loan_data = loan_data.drop("state", axis=1)
loan_data.head()
| interest_rate | loan_requested | loan_funded | investor_funded | loan_grade | home_ownership | annual_income | income_verification | loan_category | debt_payment_ratio | delinquency_frequency | creditor_inquiries | open_credit_line_count | derogatory_public_record_count | total_credit | credit_utilization_rate | total_credit_line_count | initial_status | issued_month | issued_year | credit_earliest_month | credit_earliest_year | is_36months | YE_categories | income_over_100000 | region | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 11.89 | 25000.0 | 25000.0 | 19080.0 | B | RENT | 85000.0 | verified | debt_consolidation | 19.48 | 0.0 | 0.0 | 10.0 | 0.0 | 28854.0 | 52.1 | 42.0 | f | 8 | 2009 | 2 | 1994 | 0 | G1 | 1 | West |
| 1 | 10.71 | 7000.0 | 7000.0 | 673.0 | B | RENT | 65000.0 | not verified | credit_card | 14.29 | 0.0 | 0.0 | 7.0 | 0.0 | 33623.0 | 76.7 | 7.0 | f | 5 | 2008 | 10 | 2000 | 0 | G1 | 1 | Northeast |
| 2 | 16.99 | 25000.0 | 25000.0 | 24725.0 | D | RENT | 70000.0 | verified | debt_consolidation | 10.50 | 0.0 | 0.0 | 10.0 | 0.0 | 19878.0 | 66.3 | 17.0 | f | 8 | 2014 | 6 | 2000 | 0 | G1 | 1 | Northeast |
| 3 | 13.11 | 1200.0 | 1200.0 | 1200.0 | C | OWN | 54000.0 | not verified | debt_consolidation | 5.47 | 0.0 | 0.0 | 5.0 | 0.0 | 2584.0 | 40.4 | 31.0 | f | 3 | 2010 | 1 | 1985 | 0 | G3 | 1 | South |
| 4 | 13.57 | 10800.0 | 10800.0 | 10692.0 | C | RENT | 32000.0 | not verified | debt_consolidation | 11.63 | 0.0 | 1.0 | 14.0 | 0.0 | 3511.0 | 25.6 | 40.0 | f | 11 | 2009 | 12 | 1996 | 0 | G2 | 1 | Northeast |
loan_data["region"].value_counts()
South 107304 West 99162 Northeast 81306 Midwest 48472 Other 2745 Name: region, dtype: int64
loan_data.hist(column = 'debt_payment_ratio')
array([[<AxesSubplot:title={'center':'debt_payment_ratio'}>]],
dtype=object)
loan_data['debt_payment_ratio'].describe()
count 338989.000000 mean 16.997242 std 7.697223 min 0.000000 25% 11.250000 50% 16.700000 75% 22.500000 max 39.990000 Name: debt_payment_ratio, dtype: float64
len(loan_data[loan_data['debt_payment_ratio']>36])
845
loan_data.hist(column = 'delinquency_frequency')
array([[<AxesSubplot:title={'center':'delinquency_frequency'}>]],
dtype=object)
loan_data['delinquency_frequency'].describe()
count 338989.000000 mean 0.274286 std 0.779294 min 0.000000 25% 0.000000 50% 0.000000 75% 0.000000 max 29.000000 Name: delinquency_frequency, dtype: float64
len(loan_data[loan_data['delinquency_frequency'] == 0])
280249
280249/338989
0.8267200410632793
loan_data['has_delinquency'] = loan_data['delinquency_frequency'].apply(lambda x: 1 if x != 0 else 0)
loan_data['has_delinquency'].value_counts()
0 280249 1 58740 Name: has_delinquency, dtype: int64
sns.countplot(x='has_delinquency', data=loan_data)
<AxesSubplot:xlabel='has_delinquency', ylabel='count'>
loan_data = loan_data.drop(columns = ['delinquency_frequency'])
# loan_data.head(5)
loan_data.hist(column = 'creditor_inquiries')
array([[<AxesSubplot:title={'center':'creditor_inquiries'}>]],
dtype=object)
loan_data['creditor_inquiries'].describe()
count 338989.000000 mean 0.817911 std 1.059264 min 0.000000 25% 0.000000 50% 0.000000 75% 1.000000 max 8.000000 Name: creditor_inquiries, dtype: float64
len(loan_data[loan_data['creditor_inquiries'] == 0])
172066
172066/338989
0.5075857918693527
len(loan_data[(loan_data['creditor_inquiries'] >= 0) & (loan_data['creditor_inquiries'] < 3)])
311443
311443/338989
0.9187407261002569
loan_data['has_inquiry'] = loan_data['creditor_inquiries'].apply(lambda x: 1 if x != 0 else 0)
loan_data['has_inquiry'].value_counts()
0 172066 1 166923 Name: has_inquiry, dtype: int64
sns.countplot(x='has_inquiry', data=loan_data)
<AxesSubplot:xlabel='has_inquiry', ylabel='count'>
loan_data = loan_data.drop(columns = ['creditor_inquiries'])
loan_data.hist(column = 'open_credit_line_count')
array([[<AxesSubplot:title={'center':'open_credit_line_count'}>]],
dtype=object)
loan_data['open_credit_line_count'].describe()
count 338989.000000 mean 11.118081 std 4.896081 min 0.000000 25% 8.000000 50% 10.000000 75% 14.000000 max 76.000000 Name: open_credit_line_count, dtype: float64
loan_data['log_open_credit_line'] = np.log10(loan_data['open_credit_line_count'])
loan_data.head(5)
/Users/maggiechuang/opt/anaconda3/lib/python3.9/site-packages/pandas/core/arraylike.py:397: RuntimeWarning: divide by zero encountered in log10 result = getattr(ufunc, method)(*inputs, **kwargs)
| interest_rate | loan_requested | loan_funded | investor_funded | loan_grade | home_ownership | annual_income | income_verification | loan_category | debt_payment_ratio | open_credit_line_count | derogatory_public_record_count | total_credit | credit_utilization_rate | total_credit_line_count | initial_status | issued_month | issued_year | credit_earliest_month | credit_earliest_year | is_36months | YE_categories | income_over_100000 | region | has_delinquency | has_inquiry | log_open_credit_line | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 11.89 | 25000.0 | 25000.0 | 19080.0 | B | RENT | 85000.0 | verified | debt_consolidation | 19.48 | 10.0 | 0.0 | 28854.0 | 52.1 | 42.0 | f | 8 | 2009 | 2 | 1994 | 0 | G1 | 1 | West | 0 | 0 | 1.000000 |
| 1 | 10.71 | 7000.0 | 7000.0 | 673.0 | B | RENT | 65000.0 | not verified | credit_card | 14.29 | 7.0 | 0.0 | 33623.0 | 76.7 | 7.0 | f | 5 | 2008 | 10 | 2000 | 0 | G1 | 1 | Northeast | 0 | 0 | 0.845098 |
| 2 | 16.99 | 25000.0 | 25000.0 | 24725.0 | D | RENT | 70000.0 | verified | debt_consolidation | 10.50 | 10.0 | 0.0 | 19878.0 | 66.3 | 17.0 | f | 8 | 2014 | 6 | 2000 | 0 | G1 | 1 | Northeast | 0 | 0 | 1.000000 |
| 3 | 13.11 | 1200.0 | 1200.0 | 1200.0 | C | OWN | 54000.0 | not verified | debt_consolidation | 5.47 | 5.0 | 0.0 | 2584.0 | 40.4 | 31.0 | f | 3 | 2010 | 1 | 1985 | 0 | G3 | 1 | South | 0 | 0 | 0.698970 |
| 4 | 13.57 | 10800.0 | 10800.0 | 10692.0 | C | RENT | 32000.0 | not verified | debt_consolidation | 11.63 | 14.0 | 0.0 | 3511.0 | 25.6 | 40.0 | f | 11 | 2009 | 12 | 1996 | 0 | G2 | 1 | Northeast | 0 | 1 | 1.146128 |
# loan_data['log_open_credit_line'].unique()
# len(loan_data[loan_data['log_open_credit_line'] <0])
loan_data.loc[loan_data['log_open_credit_line'] <0]
| interest_rate | loan_requested | loan_funded | investor_funded | loan_grade | home_ownership | annual_income | income_verification | loan_category | debt_payment_ratio | open_credit_line_count | derogatory_public_record_count | total_credit | credit_utilization_rate | total_credit_line_count | initial_status | issued_month | issued_year | credit_earliest_month | credit_earliest_year | is_36months | YE_categories | income_over_100000 | region | has_delinquency | has_inquiry | log_open_credit_line | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 101071 | 16.29 | 3000.0 | 3000.0 | 3000.0 | C | RENT | 50000.000000 | not verified | car | 0.00 | 0.0 | 0.0 | 0.0 | NaN | 5.0 | f | 1 | 2013 | 12 | 2003 | 0 | G1 | 1 | South | 0 | 0 | -inf |
| 123995 | 22.47 | 3900.0 | 3900.0 | 3900.0 | Other | RENT | 13000.000000 | verified | debt_consolidation | 6.44 | 0.0 | 0.0 | 0.0 | NaN | 10.0 | f | 4 | 2013 | 4 | 2004 | 0 | G1 | 1 | Midwest | 0 | 0 | -inf |
| 148869 | 18.25 | 16000.0 | 16000.0 | 16000.0 | Other | RENT | 55666.666667 | verified | car | 0.00 | 0.0 | 0.0 | 0.0 | NaN | 19.0 | w | 8 | 2013 | 1 | 1995 | 0 | G2 | 1 | Northeast | 0 | 0 | -inf |
| 235849 | 14.98 | 8000.0 | 8000.0 | 8000.0 | C | RENT | 120000.000000 | verified | major_purchase | 0.00 | 0.0 | 0.0 | 0.0 | NaN | 6.0 | f | 1 | 2014 | 7 | 2008 | 0 | G1 | 0 | West | 0 | 0 | -inf |
| 268243 | 21.18 | 3550.0 | 3550.0 | 3550.0 | Other | OWN | 43680.000000 | verified | debt_consolidation | 0.00 | 0.0 | 0.0 | 0.0 | NaN | 3.0 | w | 3 | 2014 | 12 | 1980 | 0 | G1 | 1 | South | 0 | 0 | -inf |
| 269378 | 18.92 | 9400.0 | 9400.0 | 9400.0 | Other | OWN | 93000.000000 | verified | credit_card | 0.52 | 0.0 | 0.0 | 0.0 | NaN | 6.0 | f | 3 | 2014 | 5 | 1997 | 0 | G1 | 1 | Northeast | 1 | 1 | -inf |
# loan_data['has_inquiry'] = loan_data['creditor_inquiries'].apply(lambda x: 1 if x != 0 else 0)
loan_data['log_open_credit_line'].replace([-np.inf], 0, inplace = True)
loan_data['log_open_credit_line'].unique()
array([1. , 0.84509804, 0.69897 , 1.14612804, 0.77815125,
0.47712125, 1.11394335, 1.04139269, 1.30103 , 0.30103 ,
1.20411998, 0.60205999, 1.07918125, 1.23044892, 0.90308999,
1.17609126, 1.32221929, 0.95424251, 1.2787536 , 1.36172784,
1.43136376, 1.38021124, 1.34242268, 1.47712125, 1.62324929,
1.49136169, 1.50514998, 1.25527251, 1.39794001, 1.61278386,
1.462398 , 1.44715803, 1.41497335, 1.54406804, 1.64345268,
1.53147892, 1.5563025 , 1.5797836 , 1.63346846, 1.51851394,
1.56820172, 1.69019608, 1.65321251, 1.59106461, 1.66275783,
0. , 1.69897 , 1.60205999, 1.68124124, 1.79239169,
1.76342799, 1.70757018, 1.78532984, 1.88081359, 1.67209786,
1.73239376, 1.74036269, 1.72427587])
loan_data.hist(column = 'log_open_credit_line')
array([[<AxesSubplot:title={'center':'log_open_credit_line'}>]],
dtype=object)
loan_data = loan_data.drop(['open_credit_line_count'], axis=1)
loan_data.hist(column = 'derogatory_public_record_count')
array([[<AxesSubplot:title={'center':'derogatory_public_record_count'}>]],
dtype=object)
loan_data['derogatory_public_record_count'].describe()
count 338989.000000 mean 0.152689 std 0.505363 min 0.000000 25% 0.000000 50% 0.000000 75% 0.000000 max 63.000000 Name: derogatory_public_record_count, dtype: float64
len(loan_data[loan_data['derogatory_public_record_count'] == 0])
296229
296229/338989
0.8738602137532486
loan_data['derogatory_public_record'] = loan_data['derogatory_public_record_count'].apply(lambda x: 1 if x != 0 else 0)
sns.countplot(x='derogatory_public_record', data=loan_data)
<AxesSubplot:xlabel='derogatory_public_record', ylabel='count'>
loan_data.hist(column = 'total_credit')
array([[<AxesSubplot:title={'center':'total_credit'}>]], dtype=object)
loan_data['total_credit'].describe()
count 3.389890e+05 mean 1.597434e+04 std 1.976517e+04 min 0.000000e+00 25% 6.462000e+03 50% 1.178200e+04 75% 2.023300e+04 max 2.568995e+06 Name: total_credit, dtype: float64
# 40.889k
fig = px.box(loan_data, y="total_credit")
fig.show()
len(loan_data[loan_data['total_credit'] > 40889])
15778
15778/338989
0.04654428314783075
loan_data = loan_data[loan_data['total_credit'] <= 40889]
loan_data.hist(column = 'total_credit')
array([[<AxesSubplot:title={'center':'total_credit'}>]], dtype=object)
loan_data.hist(column = 'credit_utilization_rate')
array([[<AxesSubplot:title={'center':'credit_utilization_rate'}>]],
dtype=object)
loan_data['credit_utilization_rate'].describe()
count 323009.000000 mean 55.811409 std 23.726387 min 0.000000 25% 38.900000 50% 57.200000 75% 74.300000 max 892.300000 Name: credit_utilization_rate, dtype: float64
loan_data['credit_utilization_rate'] = loan_data['credit_utilization_rate'].interpolate()
loan_data['credit_utilization_rate'].describe()
count 323211.000000 mean 55.810266 std 23.722860 min 0.000000 25% 38.900000 50% 57.200000 75% 74.300000 max 892.300000 Name: credit_utilization_rate, dtype: float64
loan_data.hist(column = 'credit_utilization_rate')
array([[<AxesSubplot:title={'center':'credit_utilization_rate'}>]],
dtype=object)
fig = px.box(loan_data, y="credit_utilization_rate")
fig.show()
len(loan_data[loan_data['credit_utilization_rate'] > 127.4])
12
loan_data = loan_data[loan_data['credit_utilization_rate'] <= 127.4]
loan_data.hist(column = 'credit_utilization_rate')
array([[<AxesSubplot:title={'center':'credit_utilization_rate'}>]],
dtype=object)
loan_data['credit_utilization_rate'].describe()
count 323199.000000 mean 55.804006 std 23.666161 min 0.000000 25% 38.900000 50% 57.200000 75% 74.300000 max 127.400000 Name: credit_utilization_rate, dtype: float64
loan_data.hist(column = 'total_credit_line_count')
array([[<AxesSubplot:title={'center':'total_credit_line_count'}>]],
dtype=object)
loan_data['total_credit_line_count'].describe()
count 323199.000000 mean 24.634126 std 11.356151 min 2.000000 25% 16.000000 50% 23.000000 75% 31.000000 max 118.000000 Name: total_credit_line_count, dtype: float64
fig = px.box(loan_data, y="total_credit_line_count")
fig.show()
len(loan_data[loan_data['total_credit_line_count'] > 53])
6022
6022/323199
0.018632483392584755
loan_data = loan_data[loan_data['total_credit_line_count'] <= 53]
loan_data['total_credit_line_count'].describe()
count 317177.000000 mean 23.968957 std 10.348813 min 2.000000 25% 16.000000 50% 23.000000 75% 31.000000 max 53.000000 Name: total_credit_line_count, dtype: float64
loan_data.hist(column = 'total_credit_line_count')
array([[<AxesSubplot:title={'center':'total_credit_line_count'}>]],
dtype=object)
loan_data['initial_status'].value_counts()
f 218274 w 98903 Name: initial_status, dtype: int64
sns.countplot(x='initial_status', data=loan_data)
<AxesSubplot:xlabel='initial_status', ylabel='count'>
loan_data.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 317177 entries, 0 to 399999 Data columns (total 27 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 interest_rate 317177 non-null float64 1 loan_requested 317177 non-null float64 2 loan_funded 317177 non-null float64 3 investor_funded 317177 non-null float64 4 loan_grade 317177 non-null object 5 home_ownership 317177 non-null object 6 annual_income 317177 non-null float64 7 income_verification 317177 non-null object 8 loan_category 317177 non-null object 9 debt_payment_ratio 317177 non-null float64 10 derogatory_public_record_count 317177 non-null float64 11 total_credit 317177 non-null float64 12 credit_utilization_rate 317177 non-null float64 13 total_credit_line_count 317177 non-null float64 14 initial_status 317177 non-null object 15 issued_month 317177 non-null int64 16 issued_year 317177 non-null int64 17 credit_earliest_month 317177 non-null int64 18 credit_earliest_year 317177 non-null int64 19 is_36months 317177 non-null int64 20 YE_categories 317177 non-null category 21 income_over_100000 317177 non-null int64 22 region 317177 non-null object 23 has_delinquency 317177 non-null int64 24 has_inquiry 317177 non-null int64 25 log_open_credit_line 317177 non-null float64 26 derogatory_public_record 317177 non-null int64 dtypes: category(1), float64(11), int64(9), object(6) memory usage: 65.6+ MB
#loan_data_df = loan_data.drop(columns = ['job_title', 'years_employed', 'annual_income'])
continuous_variable = ['loan_requested',
'loan_funded',
'investor_funded',
'debt_payment_ratio',
'derogatory_public_record_count',
'total_credit',
'credit_utilization_rate',
'total_credit_line_count',
'log_open_credit_line']
con_df = loan_data[continuous_variable]
#plotting the heatmap for correlation
ax = sns.heatmap(con_df.corr(), annot=True)
loan_data_df = loan_data.drop(columns = ['loan_funded', 'investor_funded'])
loan_data_df = pd.get_dummies(loan_data_df, columns = ['YE_categories',
'home_ownership',
'loan_grade',
'income_verification',
'initial_status',
'loan_category',
#'issued_month',
#'issued_year',
#'credit_earliest_month',
#'credit_earliest_year',
'is_36months',
'income_over_100000',
'region',
'has_delinquency',
'has_inquiry',
'derogatory_public_record'])
loan_data_df.head()
| interest_rate | loan_requested | annual_income | debt_payment_ratio | derogatory_public_record_count | total_credit | credit_utilization_rate | total_credit_line_count | issued_month | issued_year | credit_earliest_month | credit_earliest_year | log_open_credit_line | YE_categories_G1 | YE_categories_G2 | YE_categories_G3 | home_ownership_MORTGAGE | home_ownership_OTHER | home_ownership_OWN | home_ownership_RENT | loan_grade_A | loan_grade_B | loan_grade_C | loan_grade_D | loan_grade_E | loan_grade_F | loan_grade_G | loan_grade_Other | income_verification_not verified | income_verification_verified | initial_status_f | initial_status_w | loan_category_car | loan_category_credit_card | loan_category_debt_consolidation | loan_category_educational | loan_category_home_improvement | loan_category_house | loan_category_major_purchase | loan_category_medical | loan_category_moving | loan_category_other | loan_category_renewable_energy | loan_category_small_business | loan_category_vacation | loan_category_wedding | is_36months_0 | income_over_100000_0 | income_over_100000_1 | region_Midwest | region_Northeast | region_Other | region_South | region_West | has_delinquency_0 | has_delinquency_1 | has_inquiry_0 | has_inquiry_1 | derogatory_public_record_0 | derogatory_public_record_1 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 11.89 | 25000.0 | 85000.0 | 19.48 | 0.0 | 28854.0 | 52.1 | 42.0 | 8 | 2009 | 2 | 1994 | 1.000000 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 1 | 0 |
| 1 | 10.71 | 7000.0 | 65000.0 | 14.29 | 0.0 | 33623.0 | 76.7 | 7.0 | 5 | 2008 | 10 | 2000 | 0.845098 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 |
| 2 | 16.99 | 25000.0 | 70000.0 | 10.50 | 0.0 | 19878.0 | 66.3 | 17.0 | 8 | 2014 | 6 | 2000 | 1.000000 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 |
| 3 | 13.11 | 1200.0 | 54000.0 | 5.47 | 0.0 | 2584.0 | 40.4 | 31.0 | 3 | 2010 | 1 | 1985 | 0.698970 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 |
| 4 | 13.57 | 10800.0 | 32000.0 | 11.63 | 0.0 | 3511.0 | 25.6 | 40.0 | 11 | 2009 | 12 | 1996 | 1.146128 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 |
loan_data_df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 317177 entries, 0 to 399999 Data columns (total 60 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 interest_rate 317177 non-null float64 1 loan_requested 317177 non-null float64 2 annual_income 317177 non-null float64 3 debt_payment_ratio 317177 non-null float64 4 derogatory_public_record_count 317177 non-null float64 5 total_credit 317177 non-null float64 6 credit_utilization_rate 317177 non-null float64 7 total_credit_line_count 317177 non-null float64 8 issued_month 317177 non-null int64 9 issued_year 317177 non-null int64 10 credit_earliest_month 317177 non-null int64 11 credit_earliest_year 317177 non-null int64 12 log_open_credit_line 317177 non-null float64 13 YE_categories_G1 317177 non-null uint8 14 YE_categories_G2 317177 non-null uint8 15 YE_categories_G3 317177 non-null uint8 16 home_ownership_MORTGAGE 317177 non-null uint8 17 home_ownership_OTHER 317177 non-null uint8 18 home_ownership_OWN 317177 non-null uint8 19 home_ownership_RENT 317177 non-null uint8 20 loan_grade_A 317177 non-null uint8 21 loan_grade_B 317177 non-null uint8 22 loan_grade_C 317177 non-null uint8 23 loan_grade_D 317177 non-null uint8 24 loan_grade_E 317177 non-null uint8 25 loan_grade_F 317177 non-null uint8 26 loan_grade_G 317177 non-null uint8 27 loan_grade_Other 317177 non-null uint8 28 income_verification_not verified 317177 non-null uint8 29 income_verification_verified 317177 non-null uint8 30 initial_status_f 317177 non-null uint8 31 initial_status_w 317177 non-null uint8 32 loan_category_car 317177 non-null uint8 33 loan_category_credit_card 317177 non-null uint8 34 loan_category_debt_consolidation 317177 non-null uint8 35 loan_category_educational 317177 non-null uint8 36 loan_category_home_improvement 317177 non-null uint8 37 loan_category_house 317177 non-null uint8 38 loan_category_major_purchase 317177 non-null uint8 39 loan_category_medical 317177 non-null uint8 40 loan_category_moving 317177 non-null uint8 41 loan_category_other 317177 non-null uint8 42 loan_category_renewable_energy 317177 non-null uint8 43 loan_category_small_business 317177 non-null uint8 44 loan_category_vacation 317177 non-null uint8 45 loan_category_wedding 317177 non-null uint8 46 is_36months_0 317177 non-null uint8 47 income_over_100000_0 317177 non-null uint8 48 income_over_100000_1 317177 non-null uint8 49 region_Midwest 317177 non-null uint8 50 region_Northeast 317177 non-null uint8 51 region_Other 317177 non-null uint8 52 region_South 317177 non-null uint8 53 region_West 317177 non-null uint8 54 has_delinquency_0 317177 non-null uint8 55 has_delinquency_1 317177 non-null uint8 56 has_inquiry_0 317177 non-null uint8 57 has_inquiry_1 317177 non-null uint8 58 derogatory_public_record_0 317177 non-null uint8 59 derogatory_public_record_1 317177 non-null uint8 dtypes: float64(9), int64(4), uint8(47) memory usage: 48.1 MB
loan_data_df = loan_data_df.drop(columns = ['credit_earliest_month', 'credit_earliest_year'])
loan_data_df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 317177 entries, 0 to 399999 Data columns (total 58 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 interest_rate 317177 non-null float64 1 loan_requested 317177 non-null float64 2 annual_income 317177 non-null float64 3 debt_payment_ratio 317177 non-null float64 4 derogatory_public_record_count 317177 non-null float64 5 total_credit 317177 non-null float64 6 credit_utilization_rate 317177 non-null float64 7 total_credit_line_count 317177 non-null float64 8 issued_month 317177 non-null int64 9 issued_year 317177 non-null int64 10 log_open_credit_line 317177 non-null float64 11 YE_categories_G1 317177 non-null uint8 12 YE_categories_G2 317177 non-null uint8 13 YE_categories_G3 317177 non-null uint8 14 home_ownership_MORTGAGE 317177 non-null uint8 15 home_ownership_OTHER 317177 non-null uint8 16 home_ownership_OWN 317177 non-null uint8 17 home_ownership_RENT 317177 non-null uint8 18 loan_grade_A 317177 non-null uint8 19 loan_grade_B 317177 non-null uint8 20 loan_grade_C 317177 non-null uint8 21 loan_grade_D 317177 non-null uint8 22 loan_grade_E 317177 non-null uint8 23 loan_grade_F 317177 non-null uint8 24 loan_grade_G 317177 non-null uint8 25 loan_grade_Other 317177 non-null uint8 26 income_verification_not verified 317177 non-null uint8 27 income_verification_verified 317177 non-null uint8 28 initial_status_f 317177 non-null uint8 29 initial_status_w 317177 non-null uint8 30 loan_category_car 317177 non-null uint8 31 loan_category_credit_card 317177 non-null uint8 32 loan_category_debt_consolidation 317177 non-null uint8 33 loan_category_educational 317177 non-null uint8 34 loan_category_home_improvement 317177 non-null uint8 35 loan_category_house 317177 non-null uint8 36 loan_category_major_purchase 317177 non-null uint8 37 loan_category_medical 317177 non-null uint8 38 loan_category_moving 317177 non-null uint8 39 loan_category_other 317177 non-null uint8 40 loan_category_renewable_energy 317177 non-null uint8 41 loan_category_small_business 317177 non-null uint8 42 loan_category_vacation 317177 non-null uint8 43 loan_category_wedding 317177 non-null uint8 44 is_36months_0 317177 non-null uint8 45 income_over_100000_0 317177 non-null uint8 46 income_over_100000_1 317177 non-null uint8 47 region_Midwest 317177 non-null uint8 48 region_Northeast 317177 non-null uint8 49 region_Other 317177 non-null uint8 50 region_South 317177 non-null uint8 51 region_West 317177 non-null uint8 52 has_delinquency_0 317177 non-null uint8 53 has_delinquency_1 317177 non-null uint8 54 has_inquiry_0 317177 non-null uint8 55 has_inquiry_1 317177 non-null uint8 56 derogatory_public_record_0 317177 non-null uint8 57 derogatory_public_record_1 317177 non-null uint8 dtypes: float64(9), int64(2), uint8(47) memory usage: 43.3 MB
X, y = loan_data_df.drop("interest_rate", axis=1), loan_data_df["interest_rate"]
# Set seed for reproducibility
SEED = 1
# Split dataset into 70% train and 30% test
X_train, X_test, y_train, y_test = \
train_test_split(X, y,
test_size=0.2,
random_state=SEED)
# Feature selection by ExtraTreesRegressor(model based)
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score as acc
reg= ExtraTreesRegressor()
reg.fit(X_train,y_train)
ExtraTreesRegressor()
feat_importances = pd.Series(reg.feature_importances_, index=X_train.columns)
feat_importances = feat_importances.sort_values()
feat_importances.nlargest(10).plot(kind='barh')
plt.show()
lr_model = LinearRegression()
lr_model.fit(X_train, y_train)
y_pred_lr = lr_model.predict(X_test)
lr_model.score(X_train,y_train)
0.8019623655764111
lr_model.score(X_test,y_test)
0.8010978701749751
print(f"intercept: {lr_model.intercept_}")
print(f"slope: {lr_model.coef_}")
intercept: -339.7272909588108 slope: [ 3.52833426e-05 -1.11312339e-06 1.28592096e-02 3.50818215e-02 -2.32571264e-05 2.37428907e-02 -1.56538666e-02 1.31611364e-02 1.76079131e-01 9.07423987e-01 -6.05085142e-02 1.59206121e-02 4.45879021e-02 -1.34641621e-01 -2.39709113e-02 8.41243880e-02 7.44881440e-02 -8.28557352e+00 -4.78924590e+00 -1.99983212e+00 6.80154087e-01 3.39341101e+00 6.16166995e+00 7.52174087e+00 -2.68232437e+00 -2.00093985e-01 2.00093985e-01 1.24880409e-01 -1.24880409e-01 -3.46867066e-01 -6.74111792e-01 -3.24679936e-01 -3.35401868e-01 -2.29655891e-01 1.58840230e-01 -2.89338518e-01 3.79415031e-01 4.49756051e-01 4.20935081e-01 9.73173441e-02 2.14753689e-01 3.78614840e-01 1.00422804e-01 -6.66133815e-16 -9.33475291e-02 9.33475291e-02 -6.01882516e-03 -9.22060256e-03 2.06086534e-02 8.03336260e-03 -1.34025882e-02 -1.50029532e-01 1.50029532e-01 -3.06511626e-01 3.06511626e-01 -7.85082667e-02 7.85082667e-02]
# Compute mse_dt
mse_lr = MSE(y_test, y_pred_lr)
# Compute rmse_dt
rmse_lr = mse_lr**(1/2)
from sklearn.model_selection import cross_val_score
# Compute the array containing the 10-folds CV MSEs
MSE_CV_scores = - cross_val_score(lr_model, X_train, y_train, cv=10,
scoring='neg_mean_squared_error',
n_jobs=-1)
# Compute the 10-folds CV RMSE
RMSE_CV = (MSE_CV_scores.mean())**(1/2)
# Import mean_squared_error from sklearn.metrics as MSE
from sklearn.metrics import mean_squared_error as MSE
# Fit dt to the training set
lr_model.fit(X_train, y_train)
# Predict the labels of the training set
y_pred_train = lr_model.predict(X_train)
# Evaluate the training set RMSE of dt
RMSE_train = (MSE(y_train, y_pred_train))**(1/2)
# Print rmse_dt
print("Test set RMSE of dt: {:.2f}".format(rmse_lr))
# Print RMSE_CV
print('CV RMSE: {:.2f}'.format(RMSE_CV))
# Print RMSE_train
print('Train RMSE: {:.2f}'.format(RMSE_train))
Test set RMSE of dt: 1.95 CV RMSE: 1.94 Train RMSE: 1.94
# Instantiate dt
dt = DecisionTreeRegressor(max_depth=8,
min_samples_leaf=0.13,
random_state=3)
# Fit dt to the training set
dt.fit(X_train, y_train)
y_pred = dt.predict(X_test)
# Compute mse_dt
mse_dt = MSE(y_test, y_pred)
# Compute rmse_dt
rmse_dt = mse_dt**(1/2)
dt.score(X_train,y_train)
0.7186956687942803
dt.score(X_test,y_test)
0.7189822495817428
from sklearn.model_selection import cross_val_score
# Compute the array containing the 10-folds CV MSEs
MSE_CV_scores = - cross_val_score(dt, X_train, y_train, cv=10,
scoring='neg_mean_squared_error',
n_jobs=-1)
# Compute the 10-folds CV RMSE
RMSE_CV = (MSE_CV_scores.mean())**(1/2)
# Import mean_squared_error from sklearn.metrics as MSE
from sklearn.metrics import mean_squared_error as MSE
# Fit dt to the training set
dt.fit(X_train, y_train)
# Predict the labels of the training set
y_pred_train = dt.predict(X_train)
# Evaluate the training set RMSE of dt
RMSE_train = (MSE(y_train, y_pred_train))**(1/2)
# Print rmse_dt
print("Test set RMSE of dt: {:.2f}".format(rmse_dt))
# Print RMSE_CV
print('CV RMSE: {:.2f}'.format(RMSE_CV))
# Print RMSE_train
print('Train RMSE: {:.2f}'.format(RMSE_train))
Test set RMSE of dt: 2.32 CV RMSE: 2.31 Train RMSE: 2.31
# reg_decision_model=DecisionTreeRegressor()
# # fit independent varaibles to the dependent variables
# reg_decision_model.fit(X_train,y_train)
parameters={"splitter":["best","random"],
"max_depth" : [1,3,5,7,9,11,12],
"min_samples_leaf":[1,2,3,4,5,6,7,8,9,10],
"min_weight_fraction_leaf":[0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9],
"max_features":["auto","log2","sqrt",None],
"max_leaf_nodes":[None,10,20,30,40,50,60,70,80,90] }
from sklearn.model_selection import GridSearchCV
tuning_model=GridSearchCV(dt,
param_grid=parameters,
scoring='neg_mean_squared_error',
cv=3,verbose=3)
# # Hyper parameters range intialization for tuning
tuning_model.get_params
<bound method BaseEstimator.get_params of GridSearchCV(cv=3,
estimator=DecisionTreeRegressor(max_depth=8, min_samples_leaf=0.13,
random_state=3),
param_grid={'max_depth': [1, 3, 5, 7, 9, 11, 12],
'max_features': ['auto', 'log2', 'sqrt', None],
'max_leaf_nodes': [None, 10, 20, 30, 40, 50, 60, 70,
80, 90],
'min_samples_leaf': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'min_weight_fraction_leaf': [0.1, 0.2, 0.3, 0.4, 0.5,
0.6, 0.7, 0.8, 0.9],
'splitter': ['best', 'random']},
scoring='neg_mean_squared_error', verbose=3)>
# dir(GridSearchCV)
tuned_hyper_model= DecisionTreeRegressor(max_depth= 10,
max_features='auto',
max_leaf_nodes=50,
min_samples_leaf=2,
min_weight_fraction_leaf=0.05,
splitter='random')
tuned_hyper_model.fit(X_train,y_train)
DecisionTreeRegressor(max_depth=10, max_features='auto', max_leaf_nodes=50,
min_samples_leaf=2, min_weight_fraction_leaf=0.05,
splitter='random')
# prediction
tuned_pred=tuned_hyper_model.predict(X_test)
tuned_hyper_model.score(X_train,y_train)
0.7378913322230546
tuned_hyper_model.score(X_test,y_test)
0.736550612644119
# Compute mse_dt
mse_ht = MSE(y_test, tuned_pred)
# Compute rmse_dt
rmse_ht = mse_dt**(1/2)
from sklearn.model_selection import cross_val_score
# Compute the array containing the 10-folds CV MSEs
MSE_CV_scores = - cross_val_score(tuned_hyper_model, X_train, y_train, cv=10,
scoring='neg_mean_squared_error',
n_jobs=-1)
# Compute the 10-folds CV RMSE
RMSE_CV = (MSE_CV_scores.mean())**(1/2)
# Import mean_squared_error from sklearn.metrics as MSE
from sklearn.metrics import mean_squared_error as MSE
# Fit dt to the training set
dt.fit(X_train, y_train)
# Predict the labels of the training set
tuned_pred_train = tuned_hyper_model.predict(X_train)
# Evaluate the training set RMSE of dt
RMSE_train = (MSE(y_train, tuned_pred_train))**(1/2)
# Print rmse_dt
print("Test set RMSE of dt: {:.2f}".format(rmse_ht))
# Print RMSE_CV
print('CV RMSE: {:.2f}'.format(RMSE_CV))
# Print RMSE_train
print('Train RMSE: {:.2f}'.format(RMSE_train))
Test set RMSE of dt: 2.32 CV RMSE: 2.23 Train RMSE: 2.23
# # With hyperparameter tuned
# print('MAE:', metrics.mean_absolute_error(y_test,tuned_pred))
# print('MSE:', metrics.mean_squared_error(y_test, tuned_pred))
# print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, tuned_pred)))
rf = RandomForestRegressor(random_state = 42)
from pprint import pprint
# Look at parameters used by our current forest
print('Parameters currently in use:\n')
pprint(rf.get_params())
Parameters currently in use:
{'bootstrap': True,
'ccp_alpha': 0.0,
'criterion': 'squared_error',
'max_depth': None,
'max_features': 'auto',
'max_leaf_nodes': None,
'max_samples': None,
'min_impurity_decrease': 0.0,
'min_samples_leaf': 1,
'min_samples_split': 2,
'min_weight_fraction_leaf': 0.0,
'n_estimators': 100,
'n_jobs': None,
'oob_score': False,
'random_state': 42,
'verbose': 0,
'warm_start': False}
# Fit 'rf' to the training set
rf.fit(X_train, y_train)
# predict the train set labels
y_pred_train=rf.predict(X_train)
# Predict the test set labels 'y_pred_test'
y_pred_test = rf.predict(X_test)
# Evaluate the train set RMSE
rmse_train = MSE(y_train, y_pred_train)**(1/2)
# Evaluate the test set RMSE
rmse_test = MSE(y_test, y_pred)**(1/2)
# Compute the array containing the 10-folds CV MSEs
MSE_CV_scores = - cross_val_score(rf, X_train, y_train, cv=10,
scoring='neg_mean_squared_error',
n_jobs=-1)
# Compute the 10-folds CV RMSE
RMSE_CV = (MSE_CV_scores.mean())**(1/2)
# Print rmse_dt
print("Test set RMSE of rf: {:.2f}".format(rmse_test))
# Print RMSE_CV
print('CV RMSE: {:.2f}'.format(RMSE_CV))
# Print RMSE_train
print('Train RMSE: {:.2f}'.format(RMSE_train))
Test set RMSE of rf: 2.32 CV RMSE: 1.58 Train RMSE: 2.23
rf.score(X_train,y_train)
0.9815236633541242
rf.score(X_train,y_train)
0.9815236633541242
# Parameters currently in use:
# {'bootstrap': True,
# 'criterion': 'mse',
# 'max_depth': None,
# 'max_features': 'auto',
# 'max_leaf_nodes': None,
# 'min_impurity_decrease': 0.0,
# 'min_impurity_split': None,
# 'min_samples_leaf': 1,
# 'min_samples_split': 2,
# 'min_weight_fraction_leaf': 0.0,
# 'n_estimators': 10,
# 'n_jobs': 1,
# 'oob_score': False,
# 'random_state': 42,
# 'verbose': 0,
# 'warm_start': False}
from sklearn.model_selection import RandomizedSearchCV
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
# Method of selecting samples for training each tree
bootstrap = [True, False]
# Create the random grid
random_grid = {'n_estimators': n_estimators,
'max_features': max_features,
'max_depth': max_depth,
'min_samples_split': min_samples_split,
'min_samples_leaf': min_samples_leaf,
'bootstrap': bootstrap}
print(random_grid)
{'n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000], 'max_features': ['auto', 'sqrt'], 'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, None], 'min_samples_split': [2, 5, 10], 'min_samples_leaf': [1, 2, 4], 'bootstrap': [True, False]}
# Instantiate a random forests regressor 'rf' 400 estimators
SEED = 1
rf = RandomForestRegressor(n_estimators=400,
min_samples_leaf=0.12,
random_state=SEED)
# Fit 'rf' to the training set
rf.fit(X_train, y_train)
# predict the train set labels
y_pred_train=rf.predict(X_train)
# Predict the test set labels 'y_pred_test'
y_pred_test = rf.predict(X_test)
# Evaluate the train set RMSE
rmse_train = MSE(y_train, y_pred_train)**(1/2)
# Evaluate the test set RMSE
rmse_test = MSE(y_test, y_pred)**(1/2)
# Compute the array containing the 10-folds CV MSEs
MSE_CV_scores = - cross_val_score(rf, X_train, y_train, cv=10,
scoring='neg_mean_squared_error',
n_jobs=-1)
# Compute the 10-folds CV RMSE
RMSE_CV = (MSE_CV_scores.mean())**(1/2)
# prediction
rf.score(X_train,y_train)
0.22021748614548964
rf.score(X_test,y_test)
0.21763652870954786
# Print rmse_dt
print("Test set RMSE of rf: {:.2f}".format(rmse_test))
# Print RMSE_CV
print('CV RMSE: {:.2f}'.format(RMSE_CV))
# Print RMSE_train
print('Train RMSE: {:.2f}'.format(RMSE_train))
Test set RMSE of rf: 2.32 CV RMSE: 3.85 Train RMSE: 2.23
# Instantiate gb
gb = GradientBoostingRegressor(max_depth=4,
n_estimators=200,
random_state=2)
# Fit gb to the training set
gb.fit(X_train, y_train)
# predict the train set labels
y_pred_gb_train=gb.predict(X_train)
# Predict test set labels
y_pred_gb_test = gb.predict(X_test)
# Evaluate the train set RMSE
rmse_train = MSE(y_train, y_pred_gb_train)**(1/2)
# Evaluate the test set RMSE
rmse_test = MSE(y_test, y_pred_gb_test)**(1/2)
# Compute the array containing the 10-folds CV MSEs
MSE_CV_scores = - cross_val_score(gb, X_train, y_train, cv=10,
scoring='neg_mean_squared_error',
n_jobs=-1)
# Compute the 10-folds CV RMSE
RMSE_CV = (MSE_CV_scores.mean())**(1/2)
# Print rmse_dt
print("Test set RMSE of rf: {:.2f}".format(rmse_test))
# Print RMSE_CV
print('CV RMSE: {:.2f}'.format(RMSE_CV))
# Print RMSE_train
print('Train RMSE: {:.2f}'.format(RMSE_train))
Test set RMSE of rf: 1.58 CV RMSE: 1.57 Train RMSE: 2.23
# predict on
gb.score(X_train,y_train)
0.8730305534626144
gb.score(X_test,y_test)
0.8689728770713843
I applied linear regression, decision tree, random forest and gradient boosting and then fine tune the hyperparameter to address overfitting/underfitting problem and to acheive better results.
Linear regression: is used when the output variable is continuous variable. Here, I used it as a baseline model to compare with other models. The score(R square) for traning and testing results are about the same(80%) and the test(1.95), CV(1.94) and train(1.94) rmse are around the same. This model performs well and has no obvious overfitting or underfitting problem.
Decision tree: I used max_depth = 8, min_samples_leaf = 0.13, random_state=3 to instantiate decision tree. The R square for both training and testing data is 72%, which is lower than using linear regression. Also, the test(2.32), CV(2.31), and train(2.31) RMSE are around the same. Therefore, there is no obvious overfitting and underfitting problem. I then tuned the model with max_depth = 10, max_leaf_nodes = 50, min_sample_leaf = 2 to see if it improves the performance. The R-square for both training and testing is 74%. The train(2.22) and CV(2.23) RMSE are around the same yet the test set RMSE is slightly higher(2.32), showing the sign of overfitting. Even though the R square increases 2%, considering it showing the sign of overfitting, the previous DT model is better.
Random forest: I used the default random forest function and the R-square for both training and testing are around 98%. The explanatory power of predictors on target variable is high. The test set of rmse(2.32) is slightly higher than the train rmse(2.22) and the CV is lower(1.58). I tuned the hyperparameter and set the n_estimators=400, min_samples_leaf=0.12. R-square becomes very low(22%) and has overfitting problem(CV 3.85 > test 2.32 > train 2.22). The previous random forset with 98% of explanatory power performs better.
Gradient boosting: The R-square score of training and testing are 87%, which is also a good model. Nevertheless, the CV rmse(1.57) < test set rmse(1.58) < training rmsa(2.22). This model has slightly underfitting problem.
From above, since linear regression model performs well (r square = 80%) with no overfitting and underfitting problem, I will use it to predict the interest rate. In addition, random forest has 98% of explanatory problem with no serious overfitting problem.I will also use it to predict the interest rate. To improve model performance, will try to do clustering techniques on job title and loan title and include them as predictors in the future.
test_data = pd.read_csv("Holdout for Testing.csv")
test_data.head()
| X1 | X2 | X3 | X4 | X5 | X6 | X7 | X8 | X9 | X10 | X11 | X12 | X13 | X14 | X15 | X16 | X17 | X18 | X19 | X20 | X21 | X22 | X23 | X24 | X25 | X26 | X27 | X28 | X29 | X30 | X31 | X32 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | NaN | 44409194 | 47416907 | $6,000 | $6,000 | $6,000 | 36 months | C | C5 | Electrician | 10+ years | MORTGAGE | 68000.0 | VERIFIED - income | 15-Mar | NaN | debt_consolidation | Debt consolidation | 430xx | OH | 28.31 | 0 | 2-Nov | 1 | 26.0 | NaN | 18 | 0 | 19861 | 64.50% | 33 | f |
| 1 | NaN | 44017917 | 47034722 | $24,000 | $24,000 | $24,000 | 36 months | A | A1 | Executive Assistant | 8 years | RENT | 110480.0 | VERIFIED - income | 15-Mar | NaN | debt_consolidation | Debt consolidation | 945xx | CA | 16.03 | 0 | Dec-68 | 1 | NaN | NaN | 12 | 0 | 17001 | 26.20% | 36 | w |
| 2 | NaN | 44259158 | 47306871 | $35,000 | $35,000 | $35,000 | 36 months | C | C2 | District Sales Leader | 10+ years | MORTGAGE | 86000.0 | VERIFIED - income | 15-Mar | NaN | debt_consolidation | Debt consolidation | 195xx | PA | 32.49 | 0 | Oct-98 | 0 | NaN | NaN | 16 | 0 | 25797 | 49.90% | 33 | w |
| 3 | NaN | 44429213 | 47476932 | $10,000 | $10,000 | $10,000 | 60 months | D | D1 | pharmacy associate | 10+ years | RENT | 30000.0 | VERIFIED - income source | 15-Mar | NaN | debt_consolidation | Debt consolidation | 921xx | CA | 32.96 | 0 | Feb-99 | 1 | NaN | 114.0 | 13 | 1 | 9586 | 43.80% | 21 | w |
| 4 | NaN | 44299188 | 47346901 | $24,000 | $24,000 | $24,000 | 60 months | B | B1 | Medical case manager | 10+ years | MORTGAGE | 82500.0 | VERIFIED - income | 15-Mar | NaN | debt_consolidation | Debt consolidation | 196xx | PA | 31.03 | 0 | 2-Dec | 0 | 48.0 | NaN | 27 | 0 | 31842 | 41.30% | 43 | w |
test_data.shape
(80000, 32)
test_data = test_data.drop("X1", axis=1)
test_data.columns = [#'interest_rate',
'loan_id',
'borrower_id',
'loan_requested',
'loan_funded',
'investor_funded',
'payment_numbers',
'loan_grade',
'loan_subgrade',
'job_title',
'years_employed',
'home_ownership',
'annual_income',
'income_verification',
'issued_date',
'loan_reason',
'loan_category',
'loan_title',
'zip_code',
'state',
'debt_payment_ratio',
'delinquency_frequency',
'credit_earliest_date',
'creditor_inquiries',
'borrower_delinquency_recency',
'public_record_recency',
'open_credit_line_count',
'derogatory_public_record_count',
'total_credit',
'credit_utilization_rate',
'total_credit_line_count',
'initial_status'
]
test_data.head()
| loan_id | borrower_id | loan_requested | loan_funded | investor_funded | payment_numbers | loan_grade | loan_subgrade | job_title | years_employed | home_ownership | annual_income | income_verification | issued_date | loan_reason | loan_category | loan_title | zip_code | state | debt_payment_ratio | delinquency_frequency | credit_earliest_date | creditor_inquiries | borrower_delinquency_recency | public_record_recency | open_credit_line_count | derogatory_public_record_count | total_credit | credit_utilization_rate | total_credit_line_count | initial_status | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 44409194 | 47416907 | $6,000 | $6,000 | $6,000 | 36 months | C | C5 | Electrician | 10+ years | MORTGAGE | 68000.0 | VERIFIED - income | 15-Mar | NaN | debt_consolidation | Debt consolidation | 430xx | OH | 28.31 | 0 | 2-Nov | 1 | 26.0 | NaN | 18 | 0 | 19861 | 64.50% | 33 | f |
| 1 | 44017917 | 47034722 | $24,000 | $24,000 | $24,000 | 36 months | A | A1 | Executive Assistant | 8 years | RENT | 110480.0 | VERIFIED - income | 15-Mar | NaN | debt_consolidation | Debt consolidation | 945xx | CA | 16.03 | 0 | Dec-68 | 1 | NaN | NaN | 12 | 0 | 17001 | 26.20% | 36 | w |
| 2 | 44259158 | 47306871 | $35,000 | $35,000 | $35,000 | 36 months | C | C2 | District Sales Leader | 10+ years | MORTGAGE | 86000.0 | VERIFIED - income | 15-Mar | NaN | debt_consolidation | Debt consolidation | 195xx | PA | 32.49 | 0 | Oct-98 | 0 | NaN | NaN | 16 | 0 | 25797 | 49.90% | 33 | w |
| 3 | 44429213 | 47476932 | $10,000 | $10,000 | $10,000 | 60 months | D | D1 | pharmacy associate | 10+ years | RENT | 30000.0 | VERIFIED - income source | 15-Mar | NaN | debt_consolidation | Debt consolidation | 921xx | CA | 32.96 | 0 | Feb-99 | 1 | NaN | 114.0 | 13 | 1 | 9586 | 43.80% | 21 | w |
| 4 | 44299188 | 47346901 | $24,000 | $24,000 | $24,000 | 60 months | B | B1 | Medical case manager | 10+ years | MORTGAGE | 82500.0 | VERIFIED - income | 15-Mar | NaN | debt_consolidation | Debt consolidation | 196xx | PA | 31.03 | 0 | 2-Dec | 0 | 48.0 | NaN | 27 | 0 | 31842 | 41.30% | 43 | w |
test_data.isna().sum()
loan_id 0 borrower_id 0 loan_requested 0 loan_funded 0 investor_funded 0 payment_numbers 0 loan_grade 0 loan_subgrade 0 job_title 4394 years_employed 4382 home_ownership 0 annual_income 0 income_verification 0 issued_date 0 loan_reason 79985 loan_category 0 loan_title 0 zip_code 0 state 0 debt_payment_ratio 0 delinquency_frequency 0 credit_earliest_date 0 creditor_inquiries 0 borrower_delinquency_recency 38704 public_record_recency 66161 open_credit_line_count 0 derogatory_public_record_count 0 total_credit 0 credit_utilization_rate 30 total_credit_line_count 0 initial_status 0 dtype: int64
# drop na in target variable
#test_data = test_data.dropna(subset=['interest_rate'])
# drop variables that are not predictive to target variable
test_data = test_data.drop(columns = ['loan_reason', 'borrower_delinquency_recency', 'public_record_recency'])
test_data = test_data.drop(columns = ['loan_id', 'borrower_id', 'loan_subgrade', 'zip_code'])
test_data.isna().sum()
loan_requested 0 loan_funded 0 investor_funded 0 payment_numbers 0 loan_grade 0 job_title 4394 years_employed 4382 home_ownership 0 annual_income 0 income_verification 0 issued_date 0 loan_category 0 loan_title 0 state 0 debt_payment_ratio 0 delinquency_frequency 0 credit_earliest_date 0 creditor_inquiries 0 open_credit_line_count 0 derogatory_public_record_count 0 total_credit 0 credit_utilization_rate 30 total_credit_line_count 0 initial_status 0 dtype: int64
# issued_date
test_data['issued_month'] = test_data['issued_date'].apply(lambda x: str(x)[-3:])
test_data['issued_year'] = test_data['issued_date'].apply(lambda x: str(x)[:2])
# change issued_date & credit_earliest_date's Month & Year
test_data['issued_month'] = test_data['issued_month'].apply(lambda x: mtn(x))
test_data['issued_year'] = test_data['issued_year'].apply(lambda x: year(x))
# test_data['credit_earliest_date'].unique()
test_data.head()
| loan_requested | loan_funded | investor_funded | payment_numbers | loan_grade | job_title | years_employed | home_ownership | annual_income | income_verification | issued_date | loan_category | loan_title | state | debt_payment_ratio | delinquency_frequency | credit_earliest_date | creditor_inquiries | open_credit_line_count | derogatory_public_record_count | total_credit | credit_utilization_rate | total_credit_line_count | initial_status | issued_month | issued_year | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | $6,000 | $6,000 | $6,000 | 36 months | C | Electrician | 10+ years | MORTGAGE | 68000.0 | VERIFIED - income | 15-Mar | debt_consolidation | Debt consolidation | OH | 28.31 | 0 | 2-Nov | 1 | 18 | 0 | 19861 | 64.50% | 33 | f | 3 | 2015 |
| 1 | $24,000 | $24,000 | $24,000 | 36 months | A | Executive Assistant | 8 years | RENT | 110480.0 | VERIFIED - income | 15-Mar | debt_consolidation | Debt consolidation | CA | 16.03 | 0 | Dec-68 | 1 | 12 | 0 | 17001 | 26.20% | 36 | w | 3 | 2015 |
| 2 | $35,000 | $35,000 | $35,000 | 36 months | C | District Sales Leader | 10+ years | MORTGAGE | 86000.0 | VERIFIED - income | 15-Mar | debt_consolidation | Debt consolidation | PA | 32.49 | 0 | Oct-98 | 0 | 16 | 0 | 25797 | 49.90% | 33 | w | 3 | 2015 |
| 3 | $10,000 | $10,000 | $10,000 | 60 months | D | pharmacy associate | 10+ years | RENT | 30000.0 | VERIFIED - income source | 15-Mar | debt_consolidation | Debt consolidation | CA | 32.96 | 0 | Feb-99 | 1 | 13 | 1 | 9586 | 43.80% | 21 | w | 3 | 2015 |
| 4 | $24,000 | $24,000 | $24,000 | 60 months | B | Medical case manager | 10+ years | MORTGAGE | 82500.0 | VERIFIED - income | 15-Mar | debt_consolidation | Debt consolidation | PA | 31.03 | 0 | 2-Dec | 0 | 27 | 0 | 31842 | 41.30% | 43 | w | 3 | 2015 |
test_data = test_data.drop(['issued_date', 'credit_earliest_date' ], axis=1)
test_data.shape
(80000, 24)
for col in ['loan_requested', 'loan_funded', 'investor_funded']:
test_data[col] = test_data[col].str.slice(start=1).str.replace(",","").astype("float")
for col in ['credit_utilization_rate']:
test_data[col] = test_data[col].str.slice(stop=-1).astype("float")
test_data['loan_requested'].fillna(test_data['loan_requested'].median(), inplace=True)
test_data['loan_funded'].fillna(test_data['loan_funded'].median(), inplace=True)
test_data['investor_funded'].fillna(test_data['investor_funded'].median(), inplace=True)
test_data['is_36months'] = test_data['payment_numbers'].apply(lambda x: 1 if x == '36 months' else 0)
test_data = test_data.drop('payment_numbers', axis=1)
test_data.loan_grade = test_data.loan_grade.fillna('Other')
test_data.job_title = test_data.job_title.fillna('Other')
test_data = test_data.drop('job_title', axis=1)
test_data['years_employed'] = test_data['years_employed'].str.replace("< 1", "0").replace("+", "")
test_data['years_employed'] = test_data['years_employed'].str.strip()
test_data['years_employed'] = test_data['years_employed'].str.slice(stop=2)
test_data['years_employed'] = test_data['years_employed'].astype('float')
test_data['years_employed'] = test_data['years_employed'].interpolate()
year_condition = [(test_data['years_employed'] >= 0) & (test_data['years_employed'] < 6),
(test_data['years_employed'] >= 6) & (test_data['years_employed'] < 10),
test_data['years_employed'] >= 10]
test_data.shape
(80000, 23)
year_em_categories = ['G1', 'G2', 'G3']
test_data['YE_categories'] = np.select(year_condition, year_em_categories)
test_data["YE_categories"] = test_data["YE_categories"].astype("category")
test_data = test_data.drop('years_employed', axis=1)
test_data["home_ownership"] = test_data["home_ownership"].str.replace("NONE", "OTHER").str.replace("ANY", "OTHER")
test_data["home_ownership"] = test_data["home_ownership"] .fillna("OTHER")
test_data.shape
(80000, 23)
test_data.annual_income = test_data.annual_income.interpolate()
income_condition = [(test_data['annual_income'] >= 0) & (test_data['annual_income'] < 100000),
(test_data['annual_income'] >= 100000) & (test_data['annual_income'] < 200000),
(test_data['annual_income'] >= 200000) & (test_data['annual_income'] < 300000),
(test_data['annual_income'] >= 300000) & (test_data['annual_income'] < 400000),
(test_data['annual_income'] >= 400000) & (test_data['annual_income'] < 500000),
(test_data['annual_income'] >= 500000) & (test_data['annual_income'] < 600000),
(test_data['annual_income'] >= 600000) & (test_data['annual_income'] < 700000),
test_data['annual_income'] >= 700000]
income_categories = ['G1', 'G2', 'G3', 'G4', 'G5', 'G6', 'G7', 'G8']
test_data['income_categories'] = np.select(income_condition, income_categories)
test_data['income_over_100000'] = test_data['income_categories'].apply(lambda x:1 if x =='G1' else 0)
# drop income categories
test_data = test_data.drop(columns = ['income_categories'])
test_data.shape
(80000, 24)
test_data["income_verification"] = test_data["income_verification"].str.replace("VERIFIED - income", "verified").str.replace("verified source","verified")
test_data["income_verification"] = test_data["income_verification"] .fillna("verified")
test_data = test_data.drop(columns = ['loan_title'])
test_data.shape
(80000, 23)
states = {
"AK": "Other",
"AL": "South",
"AR": "South",
"AS": "Other",
"AZ": "West",
"CA": "West",
"CO": "West",
"CT": "Northeast",
"DC": "Northeast",
"DE": "Northeast",
"FL": "South",
"GA": "South",
"GU": "Other",
"HI": "Other",
"IA": "Midwest",
"ID": "West",
"IL": "Midwest",
"IN": "Midwest",
"KS": "Midwest",
"KY": "South",
"LA": "South",
"MA": "Northeast",
"MD": "Northeast",
"ME": "Northeast",
"MI": "West",
"MN": "Midwest",
"MO": "Midwest",
"MP": "Other",
"MS": "South",
"MT": "West",
"NA": "Other",
"NC": "South",
"ND": "Midwest",
"NE": "West",
"NH": "Northeast",
"NJ": "Northeast",
"NM": "West",
"NV": "West",
"NY": "Northeast",
"OH": "Midwest",
"OK": "South",
"OR": "West",
"PA": "Northeast",
"PR": "Other",
"RI": "Northeast",
"SC": "South",
"SD": "Midwest",
"TN": "South",
"TX": "South",
"UT": "West",
"VA": "South",
"VI": "Other",
"VT": "Northeast",
"WA": "West",
"WI": "Midwest",
"WV": "South",
"WY": "West",
}
test_data["region"] = test_data["state"].map(states)
test_data = test_data.drop("state", axis=1)
test_data.shape
(80000, 23)
test_data.head()
| loan_requested | loan_funded | investor_funded | loan_grade | home_ownership | annual_income | income_verification | loan_category | debt_payment_ratio | delinquency_frequency | creditor_inquiries | open_credit_line_count | derogatory_public_record_count | total_credit | credit_utilization_rate | total_credit_line_count | initial_status | issued_month | issued_year | is_36months | YE_categories | income_over_100000 | region | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 6000.0 | 6000.0 | 6000.0 | C | MORTGAGE | 68000.0 | verified | debt_consolidation | 28.31 | 0 | 1 | 18 | 0 | 19861 | 64.5 | 33 | f | 3 | 2015 | 0 | G3 | 1 | Midwest |
| 1 | 24000.0 | 24000.0 | 24000.0 | A | RENT | 110480.0 | verified | debt_consolidation | 16.03 | 0 | 1 | 12 | 0 | 17001 | 26.2 | 36 | w | 3 | 2015 | 0 | G2 | 0 | West |
| 2 | 35000.0 | 35000.0 | 35000.0 | C | MORTGAGE | 86000.0 | verified | debt_consolidation | 32.49 | 0 | 0 | 16 | 0 | 25797 | 49.9 | 33 | w | 3 | 2015 | 0 | G3 | 1 | Northeast |
| 3 | 10000.0 | 10000.0 | 10000.0 | D | RENT | 30000.0 | verified | debt_consolidation | 32.96 | 0 | 1 | 13 | 1 | 9586 | 43.8 | 21 | w | 3 | 2015 | 0 | G3 | 1 | West |
| 4 | 24000.0 | 24000.0 | 24000.0 | B | MORTGAGE | 82500.0 | verified | debt_consolidation | 31.03 | 0 | 0 | 27 | 0 | 31842 | 41.3 | 43 | w | 3 | 2015 | 0 | G3 | 1 | Northeast |
test_data['has_inquiry'] = test_data['creditor_inquiries'].apply(lambda x: 1 if x != 0 else 0)
test_data.shape
(80000, 24)
test_data['log_open_credit_line'] = np.log10(test_data['open_credit_line_count'])
test_data['log_open_credit_line'].replace([-np.inf], 0, inplace = True)
test_data = test_data.drop(['open_credit_line_count'], axis=1)
test_data.shape
(80000, 24)
test_data['derogatory_public_record'] = test_data['derogatory_public_record_count'].apply(lambda x: 1 if x != 0 else 0)
#test_data = test_data[test_data['total_credit'] <= 40889]
test_data['credit_utilization_rate'] = test_data['credit_utilization_rate'].interpolate()
#test_data = test_data[test_data['credit_utilization_rate'] <= 127.4]
#test_data = test_data[test_data['total_credit_line_count'] <= 53]
test_data.shape
(80000, 25)
test_data['has_delinquency'] = test_data['delinquency_frequency'].apply(lambda x: 1 if x != 0 else 0)
test_data = test_data.drop(columns = ['delinquency_frequency'])
test_data.isna().sum()
loan_requested 0 loan_funded 0 investor_funded 0 loan_grade 0 home_ownership 0 annual_income 0 income_verification 0 loan_category 0 debt_payment_ratio 0 creditor_inquiries 0 derogatory_public_record_count 0 total_credit 0 credit_utilization_rate 0 total_credit_line_count 0 initial_status 0 issued_month 0 issued_year 0 is_36months 0 YE_categories 0 income_over_100000 0 region 0 has_inquiry 0 log_open_credit_line 0 derogatory_public_record 0 has_delinquency 0 dtype: int64
test_data.shape
(80000, 25)
loan_data.shape
(317177, 27)
test_data.head()
| loan_requested | loan_funded | investor_funded | loan_grade | home_ownership | annual_income | income_verification | loan_category | debt_payment_ratio | creditor_inquiries | derogatory_public_record_count | total_credit | credit_utilization_rate | total_credit_line_count | initial_status | issued_month | issued_year | is_36months | YE_categories | income_over_100000 | region | has_inquiry | log_open_credit_line | derogatory_public_record | has_delinquency | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 6000.0 | 6000.0 | 6000.0 | C | MORTGAGE | 68000.0 | verified | debt_consolidation | 28.31 | 1 | 0 | 19861 | 64.5 | 33 | f | 3 | 2015 | 0 | G3 | 1 | Midwest | 1 | 1.255273 | 0 | 0 |
| 1 | 24000.0 | 24000.0 | 24000.0 | A | RENT | 110480.0 | verified | debt_consolidation | 16.03 | 1 | 0 | 17001 | 26.2 | 36 | w | 3 | 2015 | 0 | G2 | 0 | West | 1 | 1.079181 | 0 | 0 |
| 2 | 35000.0 | 35000.0 | 35000.0 | C | MORTGAGE | 86000.0 | verified | debt_consolidation | 32.49 | 0 | 0 | 25797 | 49.9 | 33 | w | 3 | 2015 | 0 | G3 | 1 | Northeast | 0 | 1.204120 | 0 | 0 |
| 3 | 10000.0 | 10000.0 | 10000.0 | D | RENT | 30000.0 | verified | debt_consolidation | 32.96 | 1 | 1 | 9586 | 43.8 | 21 | w | 3 | 2015 | 0 | G3 | 1 | West | 1 | 1.113943 | 1 | 0 |
| 4 | 24000.0 | 24000.0 | 24000.0 | B | MORTGAGE | 82500.0 | verified | debt_consolidation | 31.03 | 0 | 0 | 31842 | 41.3 | 43 | w | 3 | 2015 | 0 | G3 | 1 | Northeast | 0 | 1.431364 | 0 | 0 |
test_data_df = test_data.drop(columns = ['loan_funded', 'investor_funded'])
test_data_df.shape
(80000, 23)
loan_data_df.head()
| interest_rate | loan_requested | annual_income | debt_payment_ratio | derogatory_public_record_count | total_credit | credit_utilization_rate | total_credit_line_count | issued_month | issued_year | log_open_credit_line | YE_categories_G1 | YE_categories_G2 | YE_categories_G3 | home_ownership_MORTGAGE | home_ownership_OTHER | home_ownership_OWN | home_ownership_RENT | loan_grade_A | loan_grade_B | loan_grade_C | loan_grade_D | loan_grade_E | loan_grade_F | loan_grade_G | loan_grade_Other | income_verification_not verified | income_verification_verified | initial_status_f | initial_status_w | loan_category_car | loan_category_credit_card | loan_category_debt_consolidation | loan_category_educational | loan_category_home_improvement | loan_category_house | loan_category_major_purchase | loan_category_medical | loan_category_moving | loan_category_other | loan_category_renewable_energy | loan_category_small_business | loan_category_vacation | loan_category_wedding | is_36months_0 | income_over_100000_0 | income_over_100000_1 | region_Midwest | region_Northeast | region_Other | region_South | region_West | has_delinquency_0 | has_delinquency_1 | has_inquiry_0 | has_inquiry_1 | derogatory_public_record_0 | derogatory_public_record_1 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 11.89 | 25000.0 | 85000.0 | 19.48 | 0.0 | 28854.0 | 52.1 | 42.0 | 8 | 2009 | 1.000000 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 1 | 0 |
| 1 | 10.71 | 7000.0 | 65000.0 | 14.29 | 0.0 | 33623.0 | 76.7 | 7.0 | 5 | 2008 | 0.845098 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 |
| 2 | 16.99 | 25000.0 | 70000.0 | 10.50 | 0.0 | 19878.0 | 66.3 | 17.0 | 8 | 2014 | 1.000000 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 |
| 3 | 13.11 | 1200.0 | 54000.0 | 5.47 | 0.0 | 2584.0 | 40.4 | 31.0 | 3 | 2010 | 0.698970 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 |
| 4 | 13.57 | 10800.0 | 32000.0 | 11.63 | 0.0 | 3511.0 | 25.6 | 40.0 | 11 | 2009 | 1.146128 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 |
loan_data_df.home_ownership_OTHER.value_counts()
0 268431 1 48746 Name: home_ownership_OTHER, dtype: int64
loan_data_df.loan_category_educational.value_counts()
0 316916 1 261 Name: loan_category_educational, dtype: int64
316916/(316916+261)
0.9991771156168322
1-0.9991771156168322
0.000822884383167799
test_data_df.shape
(80000, 23)
test_data_df.head()
| loan_requested | loan_grade | home_ownership | annual_income | income_verification | loan_category | debt_payment_ratio | creditor_inquiries | derogatory_public_record_count | total_credit | credit_utilization_rate | total_credit_line_count | initial_status | issued_month | issued_year | is_36months | YE_categories | income_over_100000 | region | has_inquiry | log_open_credit_line | derogatory_public_record | has_delinquency | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 6000.0 | C | MORTGAGE | 68000.0 | verified | debt_consolidation | 28.31 | 1 | 0 | 19861 | 64.5 | 33 | f | 3 | 2015 | 0 | G3 | 1 | Midwest | 1 | 1.255273 | 0 | 0 |
| 1 | 24000.0 | A | RENT | 110480.0 | verified | debt_consolidation | 16.03 | 1 | 0 | 17001 | 26.2 | 36 | w | 3 | 2015 | 0 | G2 | 0 | West | 1 | 1.079181 | 0 | 0 |
| 2 | 35000.0 | C | MORTGAGE | 86000.0 | verified | debt_consolidation | 32.49 | 0 | 0 | 25797 | 49.9 | 33 | w | 3 | 2015 | 0 | G3 | 1 | Northeast | 0 | 1.204120 | 0 | 0 |
| 3 | 10000.0 | D | RENT | 30000.0 | verified | debt_consolidation | 32.96 | 1 | 1 | 9586 | 43.8 | 21 | w | 3 | 2015 | 0 | G3 | 1 | West | 1 | 1.113943 | 1 | 0 |
| 4 | 24000.0 | B | MORTGAGE | 82500.0 | verified | debt_consolidation | 31.03 | 0 | 0 | 31842 | 41.3 | 43 | w | 3 | 2015 | 0 | G3 | 1 | Northeast | 0 | 1.431364 | 0 | 0 |
test_data_df = pd.get_dummies(test_data_df, columns = ['YE_categories',
'home_ownership',
'loan_grade',
'income_verification',
'initial_status',
'loan_category',
#'issued_month',
#'issued_year',
#'credit_earliest_month',
#'credit_earliest_year',
'is_36months',
'income_over_100000',
'region',
'has_delinquency',
'has_inquiry',
'derogatory_public_record'])
col = list(test_data_df.columns)
col2 = list(loan_data_df.columns)
for c in loan_data_df.columns:
if c not in test_data_df.columns:
print(c)
interest_rate home_ownership_OTHER loan_grade_Other loan_category_educational
# filling in the missing columns and fill in 0, 1 values according to the proportion in training set
test_data_df['home_ownership_OTHER'] = np.random.choice([0, 1], size=len(test_data_df), p=[0.84, 0.16])
test_data_df['loan_grade_Other'] = np.random.choice([0, 1], size=len(test_data_df), p=[0.85, 0.15])
test_data_df['loan_category_educational'] = np.random.choice([0, 1], size=len(test_data_df), p=[0.99, 0.01])
test_data_df.shape
(80000, 58)
loan_data_df.shape
(317177, 58)
test_data_df.columns
Index(['loan_requested', 'annual_income', 'debt_payment_ratio',
'creditor_inquiries', 'derogatory_public_record_count', 'total_credit',
'credit_utilization_rate', 'total_credit_line_count', 'issued_month',
'issued_year', 'log_open_credit_line', 'YE_categories_G1',
'YE_categories_G2', 'YE_categories_G3', 'home_ownership_MORTGAGE',
'home_ownership_OWN', 'home_ownership_RENT', 'loan_grade_A',
'loan_grade_B', 'loan_grade_C', 'loan_grade_D', 'loan_grade_E',
'loan_grade_F', 'loan_grade_G', 'income_verification_not verified',
'income_verification_verified', 'initial_status_f', 'initial_status_w',
'loan_category_car', 'loan_category_credit_card',
'loan_category_debt_consolidation', 'loan_category_home_improvement',
'loan_category_house', 'loan_category_major_purchase',
'loan_category_medical', 'loan_category_moving', 'loan_category_other',
'loan_category_renewable_energy', 'loan_category_small_business',
'loan_category_vacation', 'loan_category_wedding', 'is_36months_0',
'income_over_100000_0', 'income_over_100000_1', 'region_Midwest',
'region_Northeast', 'region_Other', 'region_South', 'region_West',
'has_delinquency_0', 'has_delinquency_1', 'has_inquiry_0',
'has_inquiry_1', 'derogatory_public_record_0',
'derogatory_public_record_1', 'home_ownership_OTHER',
'loan_grade_Other', 'loan_category_educational'],
dtype='object')
loan_data_df.columns
Index(['interest_rate', 'loan_requested', 'annual_income',
'debt_payment_ratio', 'derogatory_public_record_count', 'total_credit',
'credit_utilization_rate', 'total_credit_line_count', 'issued_month',
'issued_year', 'log_open_credit_line', 'YE_categories_G1',
'YE_categories_G2', 'YE_categories_G3', 'home_ownership_MORTGAGE',
'home_ownership_OTHER', 'home_ownership_OWN', 'home_ownership_RENT',
'loan_grade_A', 'loan_grade_B', 'loan_grade_C', 'loan_grade_D',
'loan_grade_E', 'loan_grade_F', 'loan_grade_G', 'loan_grade_Other',
'income_verification_not verified', 'income_verification_verified',
'initial_status_f', 'initial_status_w', 'loan_category_car',
'loan_category_credit_card', 'loan_category_debt_consolidation',
'loan_category_educational', 'loan_category_home_improvement',
'loan_category_house', 'loan_category_major_purchase',
'loan_category_medical', 'loan_category_moving', 'loan_category_other',
'loan_category_renewable_energy', 'loan_category_small_business',
'loan_category_vacation', 'loan_category_wedding', 'is_36months_0',
'income_over_100000_0', 'income_over_100000_1', 'region_Midwest',
'region_Northeast', 'region_Other', 'region_South', 'region_West',
'has_delinquency_0', 'has_delinquency_1', 'has_inquiry_0',
'has_inquiry_1', 'derogatory_public_record_0',
'derogatory_public_record_1'],
dtype='object')
col = ['loan_requested',
'annual_income',
'debt_payment_ratio',
'derogatory_public_record_count',
'total_credit',
'credit_utilization_rate',
'total_credit_line_count',
'issued_month',
'issued_year',
#'credit_earliest_month',
#'credit_earliest_year',
'log_open_credit_line',
'YE_categories_G1',
'YE_categories_G2',
'YE_categories_G3',
'home_ownership_MORTGAGE',
'home_ownership_OTHER',
'home_ownership_OWN',
'home_ownership_RENT',
'loan_grade_A',
'loan_grade_B',
'loan_grade_C',
'loan_grade_D',
'loan_grade_E',
'loan_grade_F',
'loan_grade_G',
'loan_grade_Other',
'income_verification_not verified',
'income_verification_verified',
'initial_status_f',
'initial_status_w',
'loan_category_car',
'loan_category_credit_card',
'loan_category_debt_consolidation',
'loan_category_educational',
'loan_category_home_improvement',
'loan_category_house',
'loan_category_major_purchase',
'loan_category_medical',
'loan_category_moving',
'loan_category_other',
'loan_category_renewable_energy',
'loan_category_small_business',
'loan_category_vacation',
'loan_category_wedding',
'is_36months_0',
'income_over_100000_0',
'income_over_100000_1',
'region_Midwest',
'region_Northeast',
'region_Other',
'region_South',
'region_West',
'has_delinquency_0',
'has_delinquency_1',
'has_inquiry_0',
'has_inquiry_1',
'derogatory_public_record_0',
'derogatory_public_record_1']
test_data_df = test_data_df[col]
test_data_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 80000 entries, 0 to 79999 Data columns (total 57 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 loan_requested 80000 non-null float64 1 annual_income 80000 non-null float64 2 debt_payment_ratio 80000 non-null float64 3 derogatory_public_record_count 80000 non-null int64 4 total_credit 80000 non-null int64 5 credit_utilization_rate 80000 non-null float64 6 total_credit_line_count 80000 non-null int64 7 issued_month 80000 non-null int64 8 issued_year 80000 non-null int64 9 log_open_credit_line 80000 non-null float64 10 YE_categories_G1 80000 non-null uint8 11 YE_categories_G2 80000 non-null uint8 12 YE_categories_G3 80000 non-null uint8 13 home_ownership_MORTGAGE 80000 non-null uint8 14 home_ownership_OTHER 80000 non-null int64 15 home_ownership_OWN 80000 non-null uint8 16 home_ownership_RENT 80000 non-null uint8 17 loan_grade_A 80000 non-null uint8 18 loan_grade_B 80000 non-null uint8 19 loan_grade_C 80000 non-null uint8 20 loan_grade_D 80000 non-null uint8 21 loan_grade_E 80000 non-null uint8 22 loan_grade_F 80000 non-null uint8 23 loan_grade_G 80000 non-null uint8 24 loan_grade_Other 80000 non-null int64 25 income_verification_not verified 80000 non-null uint8 26 income_verification_verified 80000 non-null uint8 27 initial_status_f 80000 non-null uint8 28 initial_status_w 80000 non-null uint8 29 loan_category_car 80000 non-null uint8 30 loan_category_credit_card 80000 non-null uint8 31 loan_category_debt_consolidation 80000 non-null uint8 32 loan_category_educational 80000 non-null int64 33 loan_category_home_improvement 80000 non-null uint8 34 loan_category_house 80000 non-null uint8 35 loan_category_major_purchase 80000 non-null uint8 36 loan_category_medical 80000 non-null uint8 37 loan_category_moving 80000 non-null uint8 38 loan_category_other 80000 non-null uint8 39 loan_category_renewable_energy 80000 non-null uint8 40 loan_category_small_business 80000 non-null uint8 41 loan_category_vacation 80000 non-null uint8 42 loan_category_wedding 80000 non-null uint8 43 is_36months_0 80000 non-null uint8 44 income_over_100000_0 80000 non-null uint8 45 income_over_100000_1 80000 non-null uint8 46 region_Midwest 80000 non-null uint8 47 region_Northeast 80000 non-null uint8 48 region_Other 80000 non-null uint8 49 region_South 80000 non-null uint8 50 region_West 80000 non-null uint8 51 has_delinquency_0 80000 non-null uint8 52 has_delinquency_1 80000 non-null uint8 53 has_inquiry_0 80000 non-null uint8 54 has_inquiry_1 80000 non-null uint8 55 derogatory_public_record_0 80000 non-null uint8 56 derogatory_public_record_1 80000 non-null uint8 dtypes: float64(5), int64(8), uint8(44) memory usage: 11.3 MB
test_data_df.head()
| loan_requested | annual_income | debt_payment_ratio | derogatory_public_record_count | total_credit | credit_utilization_rate | total_credit_line_count | issued_month | issued_year | log_open_credit_line | YE_categories_G1 | YE_categories_G2 | YE_categories_G3 | home_ownership_MORTGAGE | home_ownership_OTHER | home_ownership_OWN | home_ownership_RENT | loan_grade_A | loan_grade_B | loan_grade_C | loan_grade_D | loan_grade_E | loan_grade_F | loan_grade_G | loan_grade_Other | income_verification_not verified | income_verification_verified | initial_status_f | initial_status_w | loan_category_car | loan_category_credit_card | loan_category_debt_consolidation | loan_category_educational | loan_category_home_improvement | loan_category_house | loan_category_major_purchase | loan_category_medical | loan_category_moving | loan_category_other | loan_category_renewable_energy | loan_category_small_business | loan_category_vacation | loan_category_wedding | is_36months_0 | income_over_100000_0 | income_over_100000_1 | region_Midwest | region_Northeast | region_Other | region_South | region_West | has_delinquency_0 | has_delinquency_1 | has_inquiry_0 | has_inquiry_1 | derogatory_public_record_0 | derogatory_public_record_1 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 6000.0 | 68000.0 | 28.31 | 0 | 19861 | 64.5 | 33 | 3 | 2015 | 1.255273 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 |
| 1 | 24000.0 | 110480.0 | 16.03 | 0 | 17001 | 26.2 | 36 | 3 | 2015 | 1.079181 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 1 | 0 |
| 2 | 35000.0 | 86000.0 | 32.49 | 0 | 25797 | 49.9 | 33 | 3 | 2015 | 1.204120 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 |
| 3 | 10000.0 | 30000.0 | 32.96 | 1 | 9586 | 43.8 | 21 | 3 | 2015 | 1.113943 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 1 |
| 4 | 24000.0 | 82500.0 | 31.03 | 0 | 31842 | 41.3 | 43 | 3 | 2015 | 1.431364 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 |
loan_data_df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 317177 entries, 0 to 399999 Data columns (total 58 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 interest_rate 317177 non-null float64 1 loan_requested 317177 non-null float64 2 annual_income 317177 non-null float64 3 debt_payment_ratio 317177 non-null float64 4 derogatory_public_record_count 317177 non-null float64 5 total_credit 317177 non-null float64 6 credit_utilization_rate 317177 non-null float64 7 total_credit_line_count 317177 non-null float64 8 issued_month 317177 non-null int64 9 issued_year 317177 non-null int64 10 log_open_credit_line 317177 non-null float64 11 YE_categories_G1 317177 non-null uint8 12 YE_categories_G2 317177 non-null uint8 13 YE_categories_G3 317177 non-null uint8 14 home_ownership_MORTGAGE 317177 non-null uint8 15 home_ownership_OTHER 317177 non-null uint8 16 home_ownership_OWN 317177 non-null uint8 17 home_ownership_RENT 317177 non-null uint8 18 loan_grade_A 317177 non-null uint8 19 loan_grade_B 317177 non-null uint8 20 loan_grade_C 317177 non-null uint8 21 loan_grade_D 317177 non-null uint8 22 loan_grade_E 317177 non-null uint8 23 loan_grade_F 317177 non-null uint8 24 loan_grade_G 317177 non-null uint8 25 loan_grade_Other 317177 non-null uint8 26 income_verification_not verified 317177 non-null uint8 27 income_verification_verified 317177 non-null uint8 28 initial_status_f 317177 non-null uint8 29 initial_status_w 317177 non-null uint8 30 loan_category_car 317177 non-null uint8 31 loan_category_credit_card 317177 non-null uint8 32 loan_category_debt_consolidation 317177 non-null uint8 33 loan_category_educational 317177 non-null uint8 34 loan_category_home_improvement 317177 non-null uint8 35 loan_category_house 317177 non-null uint8 36 loan_category_major_purchase 317177 non-null uint8 37 loan_category_medical 317177 non-null uint8 38 loan_category_moving 317177 non-null uint8 39 loan_category_other 317177 non-null uint8 40 loan_category_renewable_energy 317177 non-null uint8 41 loan_category_small_business 317177 non-null uint8 42 loan_category_vacation 317177 non-null uint8 43 loan_category_wedding 317177 non-null uint8 44 is_36months_0 317177 non-null uint8 45 income_over_100000_0 317177 non-null uint8 46 income_over_100000_1 317177 non-null uint8 47 region_Midwest 317177 non-null uint8 48 region_Northeast 317177 non-null uint8 49 region_Other 317177 non-null uint8 50 region_South 317177 non-null uint8 51 region_West 317177 non-null uint8 52 has_delinquency_0 317177 non-null uint8 53 has_delinquency_1 317177 non-null uint8 54 has_inquiry_0 317177 non-null uint8 55 has_inquiry_1 317177 non-null uint8 56 derogatory_public_record_0 317177 non-null uint8 57 derogatory_public_record_1 317177 non-null uint8 dtypes: float64(9), int64(2), uint8(47) memory usage: 43.3 MB
lr_model = LinearRegression()
lr_model.fit(X_train, y_train)
y_pred_lr = lr_model.predict(test_data_df)
rf = RandomForestRegressor(random_state = 42)
# Fit 'rf' to the training set
rf.fit(X_train, y_train)
# Predict the test set labels 'y_pred_test'
y_pred_test = rf.predict(test_data_df)
test = pd.read_csv('Holdout for Testing.csv')
predict = pd.DataFrame({
"Loan_Id": test["X2"],
"Predicted_Interest_Rate": y_pred_lr
})
predict.to_csv('Loan_Results_from_Maggie_Chuang_LR2.csv', index=False)
test = pd.read_csv('Holdout for Testing.csv')
predict = pd.DataFrame({
"Loan_Id": test["X2"],
"Predicted_Interest_Rate": y_pred_test
})
predict.to_csv('Loan_Results_from_Maggie_Chuang_RF2.csv', index=False)